L3Cube-MahaNews: News-based Short Text and Long Document Classification Datasets in Marathi

First Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain
\AndSecond Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain
Saloni Mittal^1,3, Vidula Magdum^1,3, Omkar Dhekane^1,3, Sharayu Hiwarkhedkar^1,3
Raviraj Joshi^2,3
¹ Pune Institute of Computer Technology, Pune, Maharashtra India
² Indian Institute of Technology Madras, Chennai, Tamil Nadu India
³ L3Cube Labs, Pune
{salonimittal12, vidulamagdum12, omkarjd1212, hiwarkhedkarsharayu}@gmail.com
[email protected]

Abstract

The availability of text or topic classification datasets in the low-resource Marathi language is limited, typically consisting of fewer than 4 target labels, with some achieving nearly perfect accuracy. In this work, we introduce L3Cube-MahaNews, a Marathi text classification corpus that focuses on News headlines and articles. This corpus stands out as the largest supervised Marathi Corpus, containing over 1.05L records classified into a diverse range of 12 categories. To accommodate different document lengths, MahaNews comprises three supervised datasets specifically designed for short text, long documents, and medium paragraphs. The consistent labeling across these datasets facilitates document length-based analysis. We provide detailed data statistics and baseline results on these datasets using state-of-the-art pre-trained BERT models. We conduct a comparative analysis between monolingual and multilingual BERT models, including MahaBERT, IndicBERT, and MuRIL. The monolingual MahaBERT model outperforms all others on every dataset. These resources also serve as Marathi topic classification datasets or models and are publicly available at https://github.com/l3cube-pune/MarathiNLP.

Keywords: Marathi Text Classification, Marathi Topic Identification, Low Resource Language, Short Text Classification, Long Document Classification, News Article Datasets, BERT, Web Scraping.

1 Introduction

Text Classification is a popular problem often discussed in machine learning and natural language processing (NLP) Minaee et al. (2021). It deals with organizing, segregating, and appropriately assigning the textual sentence or a document into some predefined categories. It is a supervised learning task and has been solved using traditional machine learning approaches and more recent deep learning algorithms Wagh et al. (2021). Text classification is important for applications like the automatic categorization of web articles or social media comments. While a lot of research has been done in the area of English text classification, low-resource languages like Marathi are still left behind. In this work, we focus on the classification of text in the Marathi language.

The Marathi language is one of the 22+ Indian languages¹¹1https://en.wikipedia.org/wiki/Languages_of_India out of the 7000 languages spoken worldwide²²2https://en.wikipedia.org/wiki/Lists_of_languages. It is the third most spoken language of India, spoken by over 83 million people across the country. It ranks 11th in the list of popular languages across the globe³³3https://en.wikipedia.org/wiki/Marathi_language. Despite being a widely spoken language, Marathi-specific NLP monolingual resources are still limited in comparison to other natural languages Joshi (2022a). As a result, sufficient data resources for machine learning tasks are less available for this language, making it challenging for researchers conducting studies in this widely used though low resource-based regional language. It can be noticed that the datasets available are largely in Mandarin Chinese, Spanish, English, Arabic, Hindi, and Bengali languages⁴⁴4https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research. There are fewer datasets on regional languages like Marathi. The only four classification datasets publicly available are iNLTK headlines Arora (2020), IndicNLP articles Kakwani et al. (2020), MahaHate Patil et al. (2022), and MahaSent Kulkarni et al. (2021). Kulkarni et al. (2022) showed that the IndicNLP News Article dataset achieves near-perfect accuracy (99%) thus limiting its usability. Therefore we need some complex datasets to evaluate the goodness of the models. Also, all of these datasets have at most four target labels. Thus, there is a significant need of datasets with exhaustive labels similar to that of BBC News⁵⁵5https://www.kaggle.com/c/learn-ai-bbc or AG News⁶⁶6https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset for the language Marathi.

Datasets with varying sequence lengths are required as transformer-based classification models are sensitive towards the text length due to their self-attention operation Beltagy et al. (2020). Models like LongFormer Beltagy et al. (2020) are specifically developed for datasets having longer sequences. In order to develop these models for Marathi we first need such target datasets. Thus, we present L3Cube-MahaNews - A Marathi News Classification Dataset in this paper.

The dataset we propose is available in three forms, viz. short, medium, and long text classification that is obtained from a renowned Marathi news website. This massive corpus of over 1 lakh records will serve as an excellent data source for the comparison of different machine-learning algorithms in low-resource settings. It contains information about 12 dynamic categories for diverse disciplines of study. We evaluate different monolingual and multilingual BERT Devlin et al. (2018) models like MahaBERT Joshi (2022a), indicBERT Kakwani et al. (2020), and MuRil Khanuja et al. (2021) and provide baseline results for future studies. The results are evaluated using the validation accuracy, testing Accuracy, F1 score (Macro), recall (Macro), and precision (Macro).

The main contributions of this work are as follows:

•

We present L3Cube-MahaNews, the first extensive document classification dataset in Marathi with 12 target labels. The dataset will be released publicly.
•

The corpus consists of three sub-datasets MahaNews-SHC, LPC, and LDC for short, medium, and long documents respectively. We provide three different datasets with varying sentence lengths and the same target labels.
•

The datasets are benchmarked using state-of-the-art BERT models like MahaBERT, MuRIL, and IndicBERT with MahaBERT giving the best results. We thus present a comparative analysis of these monolingual and multilingual BERT models for Marathi text. The MahaNews-LDC-BERT⁷⁷7l3cube-pune/marathi-topic-long-doc, MahaNews-LPC-BERT⁸⁸8l3cube-pune/marathi-topic-medium-doc, MahaNews-SHC-BERT⁹⁹9l3cube-pune/marathi-topic-short-doc, and MahaNews-All-BERT¹⁰¹⁰10l3cube-pune/marathi-topic-all-doc have been released on Hugging Face.

2 Related Work

Text classification is a very popular task in Natural Language Processing. Even though Marathi is a widely spoken language, the lack of proper Marathi datasets that can be used for text classification tasks has restricted the area of research for this language. In this section, we review a few of the publicly available Indian language datasets that are used for the objective task.

Kakwani et al. (2020) curated large-scale sentence-level monolingual corpora- IndicCorp containing 11 major Indian languages. IndicNLP News Article dataset is part of the IndicNLPSuite¹¹¹¹11https://github.com/anoopkunchukuttan/indic_nlp_library consists of news articles in Marathi categorized into 3 classes viz. sports, entertainment, and lifestyle. The datasets provided by IndicNLP are used to pre-train the word embedding and multilingual models.

Arora (2020) presented iNLTK¹²¹²12https://github.com/goru001/inltk which is an open-source NLP library containing pre-trained language models and methods for data augmentation, textual similarity, tokenization, word embeddings, etc. The iNLTK Headlines Corpus - Marathi¹³¹³13https://www.kaggle.com/datasets/disisbig/marathi-news-dataset is a Marathi News Classification Dataset provided by iNLTK, containing nearly 12000 news article headlines collected from a Marathi news website. The corpus contains 3 label classes viz. state (62%) entertainment (27%) sports (10%)

Eranpurwala et al. (2022) presented a comparative study of Marathi text classification using monolingual and multilingual embeddings. For the experiment, they use Marathi news headline dataset sourced from Kaggle with 9K examples and three label classes - entertainment, state, and sport. The news article headlines were originally collected from a Marathi news website. Their study also showed that multilingual embeddings have 15 percent performance gain compared to traditional monolingual embeddings.

Jain et al. (2020) evaluated and compared the performance of language models on text classification tasks over 3 Indian languages - Hindi, Bengali, and Telugu. For Hindi, they used BBC Hindi News Articles which contains annotated news articles classified into 14 different categories. While for Bengali and Telugu, they used classification datasets provided by Indic-NLP Kakwani et al. (2020). Their result demonstrated that monolingual models perform better for some languages but the improvement attained is marginal at best.

Patil et al. (2022) curated MahaHate- a tweet-based marathi hate speech detection dataset. The dataset is collected from Twitter and annotated manually. It consists of over 25000 distinct tweets labeled into 4 major classes i.e. hate, offensive, profane, and no. The deep learning models based on CNN, LSTM, and transformers that involved monolingual and multilingual variants of BERT were used for evaluation.

Kulkarni et al. (2021) offers first major publicly available Marathi Sentiment Analysis Dataset L3CubeMahaSent. The dataset is curated using tweets extracted from various Maharashtrian personalities’ Twitter accounts. It consists of 16,000 distinct tweets classified into three classes - positive, negative, and neutral. The authors performed 2-class and 3-class sentiment analysis on their dataset and evaluated baseline classification results using deep learning models - CNN, LSTM, and ULMFiT.

Velankar et al. (2022) conducted a comparative study between monolingual and multilingual BERT models. The standard multilingual models such as mBERT, indicBERT, and xlm-RoBERTa along with monolingual models - MahaBERT, MahaALBERT, and MahaRoBERTa for Marathi Joshi (2022a) were used in this study.

Labels	Description
Auto	Vehicle launches and their reviews
Bhakti	Horoscope, festivals, spirituality
Crime	Crimes and accidents in the country
Bildung	Educational institutes and their activities
Fashion	Fashion events, advertisements of fashion products
Health	Diseases, medicines, and health-related blogs
International	Happenings around the world
Manoranjan	Information related to movies, web series, and so on
Politics	Political incidents in the country
Sports	Various sports games, awards, sporting events and so on
Tech	Latest technologies, gadgets and their reviews
Tourismus	Travel tips, Top destinations recommendations, tourism information, et cetera.

Table 1: Categorical labels for MahaNews datasets

Labels	SHC & LDC				LPC
Labels	Train	Test	Validation	Total	Train	Test	Validation	Total
Auto	1664	209	208	2081	3099	388	387	3874
Bhakti	1386	174	173	1733	3664	458	458	4580
Crime	2354	295	294	2943	4092	512	512	5116
Bildung	680	86	85	851	1438	180	180	1798
Fashion	1920	241	240	2401	874	110	109	1093
Health	1985	249	248	2482	6428	804	803	8035
International	2041	256	255	2552	4715	590	589	5894
Manoranjan	2986	374	373	3733	4825	604	603	6032
Politics	2250	282	281	2813	4379	548	547	5474
Sports	1882	236	235	2353	5337	668	667	6672
Tech	2111	264	264	2639	2049	257	256	2562
Tourismus	755	95	94	944	1970	247	246	2463
Total	22014	2761	2750	27525	42870	5366	5357	53593

Table 2: Category-wise distribution of SHC, LDC, LPC datasets into train, test and validation in ratio of 80:10:10.

3 Curating the Dataset

We propose L3Cube-MahaNews which is a collection of datasets for short text and long document classification. The Short Headlines Classification (SHC), Long Document Classification (LDC), and Long Paragraph Classification (LPC) datasets are the three supervised datasets included in MahaNews.

•

Short Headlines Classification (SHC): This Short Document Classification dataset contains the headlines of news articles along with their corresponding categorical labels.
•

Long Paragraph Classification (LPC): This is a Long Document Classification dataset. The news articles are divided into paragraphs and each record in this dataset contains a paragraph each with its corresponding categorical label.
•

Long Document Classification (LDC): This Long Document Classification dataset contains records having an entire news article along with its corresponding categorical label.

The categorical labels in the supervised datasets are described in detail in Table 1.

Refer to caption — Figure 1: Statistical count of records in SHC, LDC, and LPC

3.1 Data Collection

The datasets are compiled using scraped news data. The entirety of the information is taken from the Lokmat¹⁴¹⁴14https://www.lokmat.com/ website which houses news articles in the Marathi language. The data was scraped by using urllib package to handle URL requests and the BeautifulSoup package to extract data from the HTML of the requested URL.

Lokmat website had arranged the news articles under predefined categories like automobile, sports, travel, politics, etc. While scraping, this categorization was preserved and further used as target labels. The final curated datasets were shuffled, de-duplicated, and cleaned up.

3.2 Data Statistics

The L3Cube-MahaNews has a total of 1,08,643 records which are derived from 27,525 news articles scraped from Lokmat. SHC and LDC have a total of 27,525 rows with labels each and LPC has 53,593 labeled rows in it.

The statistical count of records in SHC, LDC, and LPC can be referred from Figure 2 and the average count of words per record in each proposed dataset can be seen in Figure 2.

The category-wise percentage distribution for the corpora can be referred from Figure 4 and 4.

Validation

Accuracy

Testing

Accuracy

F1 Score

(Macro)

Recall

(Macro)

Precision

(Macro)

SHC

MahaBERT

91.418

91.163

90.230

89.700

91.047

indicBERT

90.073

89.388

88.303

87.953

88.758

MuRIL

90.655

90.112

89.031

88.826

89.313

LDC

MahaBERT

94.780

94.706

93.589

93.210

94.079

indicBERT

93.642

92.627

91.340

91.217

91.511

MuRIL

93.564

93.020

92.337

92.213

92.501

LPC

MahaBERT

88.754

86.731

84.915

83.455

87.138

indicBERT

86.298

85.222

86.688

81.697

84.249

MuRIL

87.157

86.582

84.585

83.215

86.603

Table 3: Results for all the models trained on SHC, LDC, and LPC datasets in percentage (%)

MahaBERT model

trained on

MahaBERT model

tested on

Testing

Accuracy

F1 Score

(Macro)

Recall

(Macro)

Precision

(Macro)

SHC

91.163

90.230

89.700

91.047

LPC

SHC

73.234

73.001

75.669

77.353

LDC

74.171

79.570

76.599

79.570

SHC+LPC+LDC

86.780

85.484

86.195

87.689

SHC

73.234

73.001

75.669

77.353

LPC

86.731

84.915

83.455

87.138

LDC

72.201

75.741

72.530

70.521

SHC+LPC+LDC

89.713

88.421

88.545

88.439

SHC

80.314

79.042

84.109

81.521

LPC

LDC

87.294

86.511

88.559

86.424

LDC

94.706

93.589

93.210

94.079

SHC+LPC+LDC

87.758

86.686

87.869

91.918

Table 4: Results for the MahaBERT models trained on SHC, LDC, and LPC datasets tested on the test set of other datasets in percentage (%)

4 Evaluation

We fine-tune the monolingual and multilingual BERT models supporting the Marathi language on the curated L3Cube-MahaNews corpus for the text classification task. A dense layer is added on top of the BERT model which maps the [CLS] token embedding to the 12 target labels.

4.1 Experiment setup

4.1.1 Data Preparation

Each of the SHC, LDC, and LPC corpora are split into train, test, and validation datasets in a ratio of 80:10:10. We have ensured that the category-wise distribution ratio of data in SHC, LDC, and LPC remains constant in the split datasets.

The datasets are preprocessed to remove unwanted characters and words from it such as newline characters, hashtags, URLs, and so on. After preprocessing, only Devanagari, English, and numerical digits are retained.

Refer to Table 2 for the category-wise distribution of data into train, test, and validation datasets.

4.1.2 Models

The pre-trained BERT models that have been finetuned for text classification are as follows:

•

MahaBERT¹⁵¹⁵15https://huggingface.co/l3cube-pune/marathi-bert-v2: MahaBERT is a 752 million token multilingual BERT model fine-tuned on L3Cube-MahaCorpus and other publicly available Marathi monolingual datasets.
•

indicBERT¹⁶¹⁶16https://huggingface.co/ai4bharat/indic-bert: IndicBERT is a multi-lingual AlBERT model exclusively pre-trained on 12 Indian languages. It is pre-trained on AI4Bharat IndicNLP Corpora of around 9 billion tokens.
•

MuRIL¹⁷¹⁷17https://huggingface.co/google/muril-base-cased: MuRIL is a BERT model pre-trained in 17 Indian languages. It has been pre-trained on datasets from Wikipedia, Common Crawl, Dakshina, etc.

It was found that these models gave the best results when they were trained for 3 epochs on the training datasets at the default learning rate (1e-3). The MuRIL model, on the other hand, performed best during 5 training epochs for the SHC dataset.

The fine-tuned MahaBERT models were also tested against test sets of the other datasets like the pre-trained MahaBERT model fine-tuned on SHC dataset was tested against the test sets for LDC and LPC to compute the results of this cross-analysis.

4.2 Results

The results obtained from fine-tuning the models on our datasets are shown in Table 3 along with the confusion matrices in Figure 7, 7 and 7. The results obtained on performing the cross-analysis by testing the MahaBERT model on test sets of different datasets can be referred from Table 4
The key observations that were inferred are as follows:

•

The monolingual MahaBERT model outperforms all other models in terms of the various scores depicted in the table for every corpus.
•

Among SHC, LDC, and LPC, LDC gave the best results in fine-tuning for the text classification task. This is expected as the long document data contains more information as compared to the other two smaller-length datasets.
•

LPC reports scores on the lower side for all the 3 models. A paragraph might at times contain more generic information and hence result in confusion for the models.
•

A cross-dataset testing or zero-shot testing on unseen datasets reveals that models trained on one dataset don’t generalize well on other test sets. This affirms the need for different datasets with varying text lengths.
•

The model trained on all three datasets (SHC + LPC + LDC) provides the best results for LPC but fares poorly for SHC and LDC. This shows that building a single competitive model needs more attention. More samples in LPC dataset as compared to other datasets could explain the bias towards LPC. An extensive evaluation of this behavior is left to future scope.

5 Conclusion

In this paper, we present L3Cube-MahaNews - a suite of 3 labeled datasets that consists of 1.08L+ Marathi records for the Marathi Text Classification. The paper describes an extensive set of 12 categorical labels used to create the supervised datasets. We have performed fine-tuning on Marathi-based models to provide a benchmark for future studies and development. The models utilized were MahaBERT, IndicBERT, and MuRIL. We report the best accuracy using MahaBERT for the LDC dataset. We hope that our datasets will play an important role in the betterment of Marathi language support in the field of NLP.

Limitations

During data scrapping and preparation, it was seen that some news articles had scanned images, GIFs, banner ads, etc., as a part of web page content. Thus, additional tools (e.g. OCR-image-to-text converter) might be required to extract text from such web content and retain only the news-related textual data in a proper format. Moreover, since the LPC dataset was created by extracting random paragraphs from the parent articles, these might at times contain generic information not specific to the target label. In future we can manually verify the dataset to filter such problematic entiries.

Acknowledgments

This work was done under the L3Cube Pune mentorship program. We would like to express our gratitude towards our mentors at L3Cube for their continuous support and encouragement. This work is a part of the L3Cube-MahaNLP project Joshi (2022b).

References

Arora (2020) Gaurav Arora. 2020. inltk: Natural language toolkit for indic languages. arXiv preprint arXiv:2009.12534.
Beltagy et al. (2020) Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Eranpurwala et al. (2022) Femida Eranpurwala, Priyanka Ramane, and Bharath Kumar Bolla. 2022. Comparative study of marathi text classification using monolingual and multilingual embeddings. In Advanced Network Technologies and Intelligent Computing: First International Conference, ANTIC 2021, Varanasi, India, December 17–18, 2021, Proceedings, pages 441–452. Springer.
Jain et al. (2020) Kushal Jain, Adwait Deshpande, Kumar Shridhar, Felix Laumann, and Ayushman Dash. 2020. Indic-transformers: An analysis of transformer language models for indian languages. arXiv preprint arXiv:2011.02323.
Joshi (2022a) Raviraj Joshi. 2022a. L3cube-mahacorpus and mahabert: Marathi monolingual corpus, marathi bert language models, and resources. In LREC 2022 Workshop Language Resources and Evaluation Conference 20-25 June 2022, page 97.
Joshi (2022b) Raviraj Joshi. 2022b. L3cube-mahanlp: Marathi natural language processing datasets, models, and library. arXiv preprint arXiv:2205.14728.
Kakwani et al. (2020) Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, NC Gokul, Avik Bhattacharyya, Mitesh M Khapra, and Pratyush Kumar. 2020. Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4948–4961.
Khanuja et al. (2021) Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, et al. 2021. Muril: Multilingual representations for indian languages. arXiv preprint arXiv:2103.10730.
Kulkarni et al. (2022) Atharva Kulkarni, Meet Mandhane, Manali Likhitkar, Gayatri Kshirsagar, Jayashree Jagdale, and Raviraj Joshi. 2022. Experimental evaluation of deep learning models for marathi text classification. In Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications: ICMISC 2021, pages 605–613. Springer.
Kulkarni et al. (2021) Atharva Kulkarni, Meet Mandhane, Manali Likhitkar, Gayatri Kshirsagar, and Raviraj Joshi. 2021. L3cubemahasent: A marathi tweet-based sentiment analysis dataset. In Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 213–220.
Minaee et al. (2021) Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2021. Deep learning–based text classification: a comprehensive review. ACM computing surveys (CSUR), 54(3):1–40.
Patil et al. (2022) Hrushikesh Patil, Abhishek Velankar, and Raviraj Joshi. 2022. L3cube-mahahate: A tweet-based marathi hate speech detection dataset and bert models. In Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022), pages 1–9.
Velankar et al. (2022) Abhishek Velankar, Hrushikesh Patil, and Raviraj Joshi. 2022. Mono vs multilingual bert for hate speech detection and text classification: A case study in marathi. In Artificial Neural Networks in Pattern Recognition: 10th IAPR TC3 Workshop, ANNPR 2022, Dubai, United Arab Emirates, November 24–26, 2022, Proceedings, pages 121–128. Springer.
Wagh et al. (2021) Vedangi Wagh, Snehal Khandve, Isha Joshi, Apurva Wani, Geetanjali Kale, and Raviraj Joshi. 2021. Comparative study of long document classification. In TENCON 2021-2021 IEEE Region 10 Conference (TENCON), pages 732–737. IEEE.