Showing 1–2 of 2 results for author: Tumrani, S

Search v0.5.6 released 2020-02-24

arXiv:2408.15720 [pdf, other]

cs.CL

An Evaluation of Sindhi Word Embedding in Semantic Analogies and Downstream Tasks

Authors: Wazir Ali, Saifullah Tumrani, Jay Kumar, Tariq Rahim Soomro

Abstract: In this paper, we propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources. We design a preprocessing pipeline for the filtration of unwanted text from crawled data. Afterwards, the cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms. For the evaluation of pretrained embe… ▽ More In this paper, we propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources. We design a preprocessing pipeline for the filtration of unwanted text from crawled data. Afterwards, the cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms. For the evaluation of pretrained embeddings, we use popular intrinsic and extrinsic evaluation approaches. The evaluation results reveal that continuous-bag-of-words and skip-gram perform better than GloVe and existing Sindhi fastText word embedding on both intrinsic and extrinsic evaluation approaches △ Less

Submitted 28 August, 2024; originally announced August 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:1911.12579
arXiv:2012.15079 [pdf, other]

cs.CL cs.LG

Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention

Authors: Wazir Ali, Jay Kumar, Saifullah Tumrani, Redhwan Nour, Adeeb Noor, Zenglin Xu

Abstract: Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It's cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitation… ▽ More Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It's cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets. △ Less

Submitted 4 September, 2024; v1 submitted 30 December, 2020; originally announced December 2020.

Comments: Journal Paper, 14 pages

Search v0.5.6 released 2020-02-24