L3Cube-MahaNews: News-based Short Text and Long Document Classification Datasets in Marathi
Abstract
The availability of text or topic classification datasets in the low-resource Marathi language is limited, typically consisting of fewer than 4 target labels, with some achieving nearly perfect accuracy. In this work, we introduce L3Cube-MahaNews, a Marathi text classification corpus that focuses on News headlines and articles. This corpus stands out as the largest supervised Marathi Corpus, containing over 1.05L records classified into a diverse range of 12 categories. To accommodate different document lengths, MahaNews comprises three supervised datasets specifically designed for short text, long documents, and medium paragraphs. The consistent labeling across these datasets facilitates document length-based analysis. We provide detailed data statistics and baseline results on these datasets using state-of-the-art pre-trained BERT models. We conduct a comparative analysis between monolingual and multilingual BERT models, including MahaBERT, IndicBERT, and MuRIL. The monolingual MahaBERT model outperforms all others on every dataset. These resources also serve as Marathi topic classification datasets or models and are publicly available at https://github.com/l3cube-pune/MarathiNLP.
Keywords: Marathi Text Classification, Marathi Topic Identification, Low Resource Language, Short Text Classification, Long Document Classification, News Article Datasets, BERT, Web Scraping.
1 Introduction
Text Classification is a popular problem often discussed in machine learning and natural language processing (NLP) Minaee et al. (2021). It deals with organizing, segregating, and appropriately assigning the textual sentence or a document into some predefined categories. It is a supervised learning task and has been solved using traditional machine learning approaches and more recent deep learning algorithms Wagh et al. (2021). Text classification is important for applications like the automatic categorization of web articles or social media comments. While a lot of research has been done in the area of English text classification, low-resource languages like Marathi are still left behind. In this work, we focus on the classification of text in the Marathi language.
The Marathi language is one of the 22+ Indian languages111https://en.wikipedia.org/wiki/Languages_of_India out of the 7000 languages spoken worldwide222https://en.wikipedia.org/wiki/Lists_of_languages. It is the third most spoken language of India, spoken by over 83 million people across the country. It ranks 11th in the list of popular languages across the globe333https://en.wikipedia.org/wiki/Marathi_language. Despite being a widely spoken language, Marathi-specific NLP monolingual resources are still limited in comparison to other natural languages Joshi (2022a). As a result, sufficient data resources for machine learning tasks are less available for this language, making it challenging for researchers conducting studies in this widely used though low resource-based regional language. It can be noticed that the datasets available are largely in Mandarin Chinese, Spanish, English, Arabic, Hindi, and Bengali languages444https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research. There are fewer datasets on regional languages like Marathi. The only four classification datasets publicly available are iNLTK headlines Arora (2020), IndicNLP articles Kakwani et al. (2020), MahaHate Patil et al. (2022), and MahaSent Kulkarni et al. (2021). Kulkarni et al. (2022) showed that the IndicNLP News Article dataset achieves near-perfect accuracy (99%) thus limiting its usability. Therefore we need some complex datasets to evaluate the goodness of the models. Also, all of these datasets have at most four target labels. Thus, there is a significant need of datasets with exhaustive labels similar to that of BBC News555https://www.kaggle.com/c/learn-ai-bbc or AG News666https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset for the language Marathi.
Datasets with varying sequence lengths are required as transformer-based classification models are sensitive towards the text length due to their self-attention operation Beltagy et al. (2020). Models like LongFormer Beltagy et al. (2020) are specifically developed for datasets having longer sequences. In order to develop these models for Marathi we first need such target datasets. Thus, we present L3Cube-MahaNews - A Marathi News Classification Dataset in this paper.
The dataset we propose is available in three forms, viz. short, medium, and long text classification that is obtained from a renowned Marathi news website. This massive corpus of over 1 lakh records will serve as an excellent data source for the comparison of different machine-learning algorithms in low-resource settings. It contains information about 12 dynamic categories for diverse disciplines of study. We evaluate different monolingual and multilingual BERT Devlin et al. (2018) models like MahaBERT Joshi (2022a), indicBERT Kakwani et al. (2020), and MuRil Khanuja et al. (2021) and provide baseline results for future studies. The results are evaluated using the validation accuracy, testing Accuracy, F1 score (Macro), recall (Macro), and precision (Macro).
The main contributions of this work are as follows:
-
•
We present L3Cube-MahaNews, the first extensive document classification dataset in Marathi with 12 target labels. The dataset will be released publicly.
-
•
The corpus consists of three sub-datasets MahaNews-SHC, LPC, and LDC for short, medium, and long documents respectively. We provide three different datasets with varying sentence lengths and the same target labels.
-
•
The datasets are benchmarked using state-of-the-art BERT models like MahaBERT, MuRIL, and IndicBERT with MahaBERT giving the best results. We thus present a comparative analysis of these monolingual and multilingual BERT models for Marathi text. The MahaNews-LDC-BERT777l3cube-pune/marathi-topic-long-doc, MahaNews-LPC-BERT888l3cube-pune/marathi-topic-medium-doc, MahaNews-SHC-BERT999l3cube-pune/marathi-topic-short-doc, and MahaNews-All-BERT101010l3cube-pune/marathi-topic-all-doc have been released on Hugging Face.
2 Related Work
Text classification is a very popular task in Natural Language Processing. Even though Marathi is a widely spoken language, the lack of proper Marathi datasets that can be used for text classification tasks has restricted the area of research for this language. In this section, we review a few of the publicly available Indian language datasets that are used for the objective task.
Kakwani et al. (2020) curated large-scale sentence-level monolingual corpora- IndicCorp containing 11 major Indian languages. IndicNLP News Article dataset is part of the IndicNLPSuite111111https://github.com/anoopkunchukuttan/indic_nlp_library consists of news articles in Marathi categorized into 3 classes viz. sports, entertainment, and lifestyle. The datasets provided by IndicNLP are used to pre-train the word embedding and multilingual models.
Arora (2020) presented iNLTK121212https://github.com/goru001/inltk which is an open-source NLP library containing pre-trained language models and methods for data augmentation, textual similarity, tokenization, word embeddings, etc. The iNLTK Headlines Corpus - Marathi131313https://www.kaggle.com/datasets/disisbig/marathi-news-dataset is a Marathi News Classification Dataset provided by iNLTK, containing nearly 12000 news article headlines collected from a Marathi news website. The corpus contains 3 label classes viz. state (62%) entertainment (27%) sports (10%)
Eranpurwala et al. (2022) presented a comparative study of Marathi text classification using monolingual and multilingual embeddings. For the experiment, they use Marathi news headline dataset sourced from Kaggle with 9K examples and three label classes - entertainment, state, and sport. The news article headlines were originally collected from a Marathi news website. Their study also showed that multilingual embeddings have 15 percent performance gain compared to traditional monolingual embeddings.
Jain et al. (2020) evaluated and compared the performance of language models on text classification tasks over 3 Indian languages - Hindi, Bengali, and Telugu. For Hindi, they used BBC Hindi News Articles which contains annotated news articles classified into 14 different categories. While for Bengali and Telugu, they used classification datasets provided by Indic-NLP Kakwani et al. (2020). Their result demonstrated that monolingual models perform better for some languages but the improvement attained is marginal at best.
Patil et al. (2022) curated MahaHate- a tweet-based marathi hate speech detection dataset. The dataset is collected from Twitter and annotated manually. It consists of over 25000 distinct tweets labeled into 4 major classes i.e. hate, offensive, profane, and no. The deep learning models based on CNN, LSTM, and transformers that involved monolingual and multilingual variants of BERT were used for evaluation.
Kulkarni et al. (2021) offers first major publicly available Marathi Sentiment Analysis Dataset L3CubeMahaSent. The dataset is curated using tweets extracted from various Maharashtrian personalities’ Twitter accounts. It consists of 16,000 distinct tweets classified into three classes - positive, negative, and neutral. The authors performed 2-class and 3-class sentiment analysis on their dataset and evaluated baseline classification results using deep learning models - CNN, LSTM, and ULMFiT.
Velankar et al. (2022) conducted a comparative study between monolingual and multilingual BERT models. The standard multilingual models such as mBERT, indicBERT, and xlm-RoBERTa along with monolingual models - MahaBERT, MahaALBERT, and MahaRoBERTa for Marathi Joshi (2022a) were used in this study.
Labels | Description |
---|---|
Auto | Vehicle launches and their reviews |
Bhakti | Horoscope, festivals, spirituality |
Crime | Crimes and accidents in the country |
Bildung | Educational institutes and their activities |
Fashion | Fashion events, advertisements of fashion products |
Health | Diseases, medicines, and health-related blogs |
International | Happenings around the world |
Manoranjan | Information related to movies, web series, and so on |
Politics | Political incidents in the country |
Sports | Various sports games, awards, sporting events and so on |
Tech | Latest technologies, gadgets and their reviews |
Tourismus | Travel tips, Top destinations recommendations, tourism information, et cetera. |
Labels | SHC & LDC | LPC | ||||||
---|---|---|---|---|---|---|---|---|
Train | Test | Validation | Total | Train | Test | Validation | Total | |
Auto | 1664 | 209 | 208 | 2081 | 3099 | 388 | 387 | 3874 |
Bhakti | 1386 | 174 | 173 | 1733 | 3664 | 458 | 458 | 4580 |
Crime | 2354 | 295 | 294 | 2943 | 4092 | 512 | 512 | 5116 |
Bildung | 680 | 86 | 85 | 851 | 1438 | 180 | 180 | 1798 |
Fashion | 1920 | 241 | 240 | 2401 | 874 | 110 | 109 | 1093 |
Health | 1985 | 249 | 248 | 2482 | 6428 | 804 | 803 | 8035 |
International | 2041 | 256 | 255 | 2552 | 4715 | 590 | 589 | 5894 |
Manoranjan | 2986 | 374 | 373 | 3733 | 4825 | 604 | 603 | 6032 |
Politics | 2250 | 282 | 281 | 2813 | 4379 | 548 | 547 | 5474 |
Sports | 1882 | 236 | 235 | 2353 | 5337 | 668 | 667 | 6672 |
Tech | 2111 | 264 | 264 | 2639 | 2049 | 257 | 256 | 2562 |
Tourismus | 755 | 95 | 94 | 944 | 1970 | 247 | 246 | 2463 |
Total | 22014 | 2761 | 2750 | 27525 | 42870 | 5366 | 5357 | 53593 |
3 Curating the Dataset
We propose L3Cube-MahaNews which is a collection of datasets for short text and long document classification. The Short Headlines Classification (SHC), Long Document Classification (LDC), and Long Paragraph Classification (LPC) datasets are the three supervised datasets included in MahaNews.
-
•
Short Headlines Classification (SHC): This Short Document Classification dataset contains the headlines of news articles along with their corresponding categorical labels.
-
•
Long Paragraph Classification (LPC): This is a Long Document Classification dataset. The news articles are divided into paragraphs and each record in this dataset contains a paragraph each with its corresponding categorical label.
-
•
Long Document Classification (LDC): This Long Document Classification dataset contains records having an entire news article along with its corresponding categorical label.
The categorical labels in the supervised datasets are described in detail in Table 1.
3.1 Data Collection
The datasets are compiled using scraped news data. The entirety of the information is taken from the Lokmat141414https://www.lokmat.com/ website which houses news articles in the Marathi language. The data was scraped by using urllib package to handle URL requests and the BeautifulSoup package to extract data from the HTML of the requested URL.
Lokmat website had arranged the news articles under predefined categories like automobile, sports, travel, politics, etc. While scraping, this categorization was preserved and further used as target labels. The final curated datasets were shuffled, de-duplicated, and cleaned up.
3.2 Data Statistics
The L3Cube-MahaNews has a total of 1,08,643 records which are derived from 27,525 news articles scraped from Lokmat. SHC and LDC have a total of 27,525 rows with labels each and LPC has 53,593 labeled rows in it.
The statistical count of records in SHC, LDC, and LPC can be referred from Figure 2 and the average count of words per record in each proposed dataset can be seen in Figure 2.
|
|
|
|
|
||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SHC | MahaBERT | 91.418 | 91.163 | 90.230 | 89.700 | 91.047 | ||||||||||
indicBERT | 90.073 | 89.388 | 88.303 | 87.953 | 88.758 | |||||||||||
MuRIL | 90.655 | 90.112 | 89.031 | 88.826 | 89.313 | |||||||||||
LDC | MahaBERT | 94.780 | 94.706 | 93.589 | 93.210 | 94.079 | ||||||||||
indicBERT | 93.642 | 92.627 | 91.340 | 91.217 | 91.511 | |||||||||||
MuRIL | 93.564 | 93.020 | 92.337 | 92.213 | 92.501 | |||||||||||
LPC | MahaBERT | 88.754 | 86.731 | 84.915 | 83.455 | 87.138 | ||||||||||
indicBERT | 86.298 | 85.222 | 86.688 | 81.697 | 84.249 | |||||||||||
MuRIL | 87.157 | 86.582 | 84.585 | 83.215 | 86.603 |
|
|
|
|
|
|
||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SHC | 91.163 | 90.230 | 89.700 | 91.047 | |||||||||||||
LPC | SHC | 73.234 | 73.001 | 75.669 | 77.353 | ||||||||||||
LDC | 74.171 | 79.570 | 76.599 | 79.570 | |||||||||||||
SHC+LPC+LDC | 86.780 | 85.484 | 86.195 | 87.689 | |||||||||||||
SHC | 73.234 | 73.001 | 75.669 | 77.353 | |||||||||||||
LPC | LPC | 86.731 | 84.915 | 83.455 | 87.138 | ||||||||||||
LDC | 72.201 | 75.741 | 72.530 | 70.521 | |||||||||||||
SHC+LPC+LDC | 89.713 | 88.421 | 88.545 | 88.439 | |||||||||||||
SHC | 80.314 | 79.042 | 84.109 | 81.521 | |||||||||||||
LPC | LDC | 87.294 | 86.511 | 88.559 | 86.424 | ||||||||||||
LDC | 94.706 | 93.589 | 93.210 | 94.079 | |||||||||||||
SHC+LPC+LDC | 87.758 | 86.686 | 87.869 | 91.918 |
4 Evaluation
We fine-tune the monolingual and multilingual BERT models supporting the Marathi language on the curated L3Cube-MahaNews corpus for the text classification task. A dense layer is added on top of the BERT model which maps the [CLS] token embedding to the 12 target labels.
4.1 Experiment setup
4.1.1 Data Preparation
Each of the SHC, LDC, and LPC corpora are split into train, test, and validation datasets in a ratio of 80:10:10. We have ensured that the category-wise distribution ratio of data in SHC, LDC, and LPC remains constant in the split datasets.
The datasets are preprocessed to remove unwanted characters and words from it such as newline characters, hashtags, URLs, and so on. After preprocessing, only Devanagari, English, and numerical digits are retained.
Refer to Table 2 for the category-wise distribution of data into train, test, and validation datasets.
4.1.2 Models
The pre-trained BERT models that have been finetuned for text classification are as follows:
-
•
MahaBERT151515https://huggingface.co/l3cube-pune/marathi-bert-v2: MahaBERT is a 752 million token multilingual BERT model fine-tuned on L3Cube-MahaCorpus and other publicly available Marathi monolingual datasets.
-
•
indicBERT161616https://huggingface.co/ai4bharat/indic-bert: IndicBERT is a multi-lingual AlBERT model exclusively pre-trained on 12 Indian languages. It is pre-trained on AI4Bharat IndicNLP Corpora of around 9 billion tokens.
-
•
MuRIL171717https://huggingface.co/google/muril-base-cased: MuRIL is a BERT model pre-trained in 17 Indian languages. It has been pre-trained on datasets from Wikipedia, Common Crawl, Dakshina, etc.
It was found that these models gave the best results when they were trained for 3 epochs on the training datasets at the default learning rate (1e-3). The MuRIL model, on the other hand, performed best during 5 training epochs for the SHC dataset.
The fine-tuned MahaBERT models were also tested against test sets of the other datasets like the pre-trained MahaBERT model fine-tuned on SHC dataset was tested against the test sets for LDC and LPC to compute the results of this cross-analysis.
4.2 Results
The results obtained from fine-tuning the models on our datasets are shown in Table 3 along with the confusion matrices in Figure 7, 7 and 7.
The results obtained on performing the cross-analysis by testing the MahaBERT model on test sets of different datasets can be referred from Table 4
The key observations that were inferred are as follows:
-
•
The monolingual MahaBERT model outperforms all other models in terms of the various scores depicted in the table for every corpus.
-
•
Among SHC, LDC, and LPC, LDC gave the best results in fine-tuning for the text classification task. This is expected as the long document data contains more information as compared to the other two smaller-length datasets.
-
•
LPC reports scores on the lower side for all the 3 models. A paragraph might at times contain more generic information and hence result in confusion for the models.
-
•
A cross-dataset testing or zero-shot testing on unseen datasets reveals that models trained on one dataset don’t generalize well on other test sets. This affirms the need for different datasets with varying text lengths.
-
•
The model trained on all three datasets (SHC + LPC + LDC) provides the best results for LPC but fares poorly for SHC and LDC. This shows that building a single competitive model needs more attention. More samples in LPC dataset as compared to other datasets could explain the bias towards LPC. An extensive evaluation of this behavior is left to future scope.
5 Conclusion
In this paper, we present L3Cube-MahaNews - a suite of 3 labeled datasets that consists of 1.08L+ Marathi records for the Marathi Text Classification. The paper describes an extensive set of 12 categorical labels used to create the supervised datasets. We have performed fine-tuning on Marathi-based models to provide a benchmark for future studies and development. The models utilized were MahaBERT, IndicBERT, and MuRIL. We report the best accuracy using MahaBERT for the LDC dataset. We hope that our datasets will play an important role in the betterment of Marathi language support in the field of NLP.
Limitations
During data scrapping and preparation, it was seen that some news articles had scanned images, GIFs, banner ads, etc., as a part of web page content. Thus, additional tools (e.g. OCR-image-to-text converter) might be required to extract text from such web content and retain only the news-related textual data in a proper format. Moreover, since the LPC dataset was created by extracting random paragraphs from the parent articles, these might at times contain generic information not specific to the target label. In future we can manually verify the dataset to filter such problematic entiries.
Acknowledgments
This work was done under the L3Cube Pune mentorship program. We would like to express our gratitude towards our mentors at L3Cube for their continuous support and encouragement. This work is a part of the L3Cube-MahaNLP project Joshi (2022b).
References
- Arora (2020) Gaurav Arora. 2020. inltk: Natural language toolkit for indic languages. arXiv preprint arXiv:2009.12534.
- Beltagy et al. (2020) Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Eranpurwala et al. (2022) Femida Eranpurwala, Priyanka Ramane, and Bharath Kumar Bolla. 2022. Comparative study of marathi text classification using monolingual and multilingual embeddings. In Advanced Network Technologies and Intelligent Computing: First International Conference, ANTIC 2021, Varanasi, India, December 17–18, 2021, Proceedings, pages 441–452. Springer.
- Jain et al. (2020) Kushal Jain, Adwait Deshpande, Kumar Shridhar, Felix Laumann, and Ayushman Dash. 2020. Indic-transformers: An analysis of transformer language models for indian languages. arXiv preprint arXiv:2011.02323.
- Joshi (2022a) Raviraj Joshi. 2022a. L3cube-mahacorpus and mahabert: Marathi monolingual corpus, marathi bert language models, and resources. In LREC 2022 Workshop Language Resources and Evaluation Conference 20-25 June 2022, page 97.
- Joshi (2022b) Raviraj Joshi. 2022b. L3cube-mahanlp: Marathi natural language processing datasets, models, and library. arXiv preprint arXiv:2205.14728.
- Kakwani et al. (2020) Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, NC Gokul, Avik Bhattacharyya, Mitesh M Khapra, and Pratyush Kumar. 2020. Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4948–4961.
- Khanuja et al. (2021) Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, et al. 2021. Muril: Multilingual representations for indian languages. arXiv preprint arXiv:2103.10730.
- Kulkarni et al. (2022) Atharva Kulkarni, Meet Mandhane, Manali Likhitkar, Gayatri Kshirsagar, Jayashree Jagdale, and Raviraj Joshi. 2022. Experimental evaluation of deep learning models for marathi text classification. In Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications: ICMISC 2021, pages 605–613. Springer.
- Kulkarni et al. (2021) Atharva Kulkarni, Meet Mandhane, Manali Likhitkar, Gayatri Kshirsagar, and Raviraj Joshi. 2021. L3cubemahasent: A marathi tweet-based sentiment analysis dataset. In Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 213–220.
- Minaee et al. (2021) Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2021. Deep learning–based text classification: a comprehensive review. ACM computing surveys (CSUR), 54(3):1–40.
- Patil et al. (2022) Hrushikesh Patil, Abhishek Velankar, and Raviraj Joshi. 2022. L3cube-mahahate: A tweet-based marathi hate speech detection dataset and bert models. In Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022), pages 1–9.
- Velankar et al. (2022) Abhishek Velankar, Hrushikesh Patil, and Raviraj Joshi. 2022. Mono vs multilingual bert for hate speech detection and text classification: A case study in marathi. In Artificial Neural Networks in Pattern Recognition: 10th IAPR TC3 Workshop, ANNPR 2022, Dubai, United Arab Emirates, November 24–26, 2022, Proceedings, pages 121–128. Springer.
- Wagh et al. (2021) Vedangi Wagh, Snehal Khandve, Isha Joshi, Apurva Wani, Geetanjali Kale, and Raviraj Joshi. 2021. Comparative study of long document classification. In TENCON 2021-2021 IEEE Region 10 Conference (TENCON), pages 732–737. IEEE.