Effect of stemming on text similarity for Arabic language at sentence level

PeerJ Comput Sci. 2021 May 14:7:e530. doi: 10.7717/peerj-cs.530. eCollection 2021.

Abstract

Semantic Text Similarity (STS) has several and important applications in the field of Natural Language Processing (NLP). The Aim of this study is to investigate the effect of stemming on text similarity for Arabic language at sentence level. Several Arabic light and heavy stemmers as well as lemmatization algorithms are used in this study, with a total of 10 algorithms. Standard training and testing data sets are used from SemEval-2017 international workshop for Task 1, Track 1 Arabic (ar-ar). Different features are selected to study the effect of stemming on text similarity based on different similarity measures. Traditional machine learning algorithms are used such as Support Vector Machines (SVM), Stochastic Gradient Descent (SGD) and Naïve Bayesian (NB). Compared to the original text, using the stemmed and lemmatized documents in experiments achieve enhanced Pearson correlation results. The best results attained when using Arabic light Stemmer (ARLSTem) and Farasa light stemmers, Farasa and Qalsadi Lemmatizers and Tashaphyne heavy stemmer. The best enhancement was about 7.34% in Pearson correlation. In general, stemming considerably improves the performance of sentence text similarly for Arabic language. However, some stemmers make results worse than those for original text; they are Khoja heavy stemmer and AlKhalil light stemmer.

Keywords: Lemmatization; Machine learning; Natural language processing; Semantic text similarity; Stemming; TF-IDF; Word embedding.

Grants and funding

This project was supported by the Deanship of Scientific Research at Prince Sattam bin Abdulaziz University under the research project No. 2019/01/9840. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.