-
Mapping Violence: Developing an Extensive Framework to Build a Bangla Sectarian Expression Dataset from Social Media Interactions
Authors:
Nazia Tasnim,
Sujan Sen Gupta,
Md. Istiak Hossain Shihab,
Fatiha Islam Juee,
Arunima Tahsin,
Pritom Ghum,
Kanij Fatema,
Marshia Haque,
Wasema Farzana,
Prionti Nasir,
Ashique KhudaBukhsh,
Farig Sadeque,
Asif Sushmit
Abstract:
Communal violence in online forums has become extremely prevalent in South Asia, where many communities of different cultures coexist and share resources. These societies exhibit a phenomenon characterized by strong bonds within their own groups and animosity towards others, leading to conflicts that frequently escalate into violent confrontations. To address this issue, we have developed the firs…
▽ More
Communal violence in online forums has become extremely prevalent in South Asia, where many communities of different cultures coexist and share resources. These societies exhibit a phenomenon characterized by strong bonds within their own groups and animosity towards others, leading to conflicts that frequently escalate into violent confrontations. To address this issue, we have developed the first comprehensive framework for the automatic detection of communal violence markers in online Bangla content accompanying the largest collection (13K raw sentences) of social media interactions that fall under the definition of four major violence class and their 16 coarse expressions. Our workflow introduces a 7-step expert annotation process incorporating insights from social scientists, linguists, and psychologists. By presenting data statistics and benchmarking performance using this dataset, we have determined that, aside from the category of Non-communal violence, Religio-communal violence is particularly pervasive in Bangla text. Moreover, we have substantiated the effectiveness of fine-tuning language models in identifying violent comments by conducting preliminary benchmarking on the state-of-the-art Bangla deep learning model.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
IPA Transcription of Bengali Texts
Authors:
Kanij Fatema,
Fazle Dawood Haider,
Nirzona Ferdousi Turpa,
Tanveer Azmal,
Sourav Ahmed,
Navid Hasan,
Mohammad Akhlaqur Rahman,
Biplab Kumar Sarkar,
Afrar Jahin,
Md. Rezuwan Hassan,
Md Foriduzzaman Zihad,
Rubayet Sabbir Faruque,
Asif Sushmit,
Mashrur Imtiaz,
Farig Sadeque,
Syed Shahrier Rahman
Abstract:
The International Phonetic Alphabet (IPA) serves to systematize phonemes in language, enabling precise textual representation of pronunciation. In Bengali phonology and phonetics, ongoing scholarly deliberations persist concerning the IPA standard and core Bengali phonemes. This work examines prior research, identifies current and potential issues, and suggests a framework for a Bengali IPA standa…
▽ More
The International Phonetic Alphabet (IPA) serves to systematize phonemes in language, enabling precise textual representation of pronunciation. In Bengali phonology and phonetics, ongoing scholarly deliberations persist concerning the IPA standard and core Bengali phonemes. This work examines prior research, identifies current and potential issues, and suggests a framework for a Bengali IPA standard, facilitating linguistic analysis and NLP resource creation and downstream technology development. In this work, we present a comprehensive study of Bengali IPA transcription and introduce a novel IPA transcription framework incorporating a novel dataset with DL-based benchmarks.
△ Less
Submitted 29 March, 2024;
originally announced March 2024.
-
Unicode Normalization and Grapheme Parsing of Indic Languages
Authors:
Nazmuddoha Ansary,
Quazi Adibur Rahman Adib,
Tahsin Reasat,
Asif Shahriyar Sushmit,
Ahmed Imtiaz Humayun,
Sazia Mehnaz,
Kanij Fatema,
Mohammad Mamun Or Rashid,
Farig Sadeque
Abstract:
Writing systems of Indic languages have orthographic syllables, also known as complex graphemes, as unique horizontal units. A prominent feature of these languages is these complex grapheme units that comprise consonants/consonant conjuncts, vowel diacritics, and consonant diacritics, which, together make a unique Language. Unicode-based writing schemes of these languages often disregard this feat…
▽ More
Writing systems of Indic languages have orthographic syllables, also known as complex graphemes, as unique horizontal units. A prominent feature of these languages is these complex grapheme units that comprise consonants/consonant conjuncts, vowel diacritics, and consonant diacritics, which, together make a unique Language. Unicode-based writing schemes of these languages often disregard this feature of these languages and encode words as linear sequences of Unicode characters using an intricate scheme of connector characters and font interpreters. Due to this way of using a few dozen Unicode glyphs to write thousands of different unique glyphs (complex graphemes), there are serious ambiguities that lead to malformed words. In this paper, we are proposing two libraries: i) a normalizer for normalizing inconsistencies caused by a Unicode-based encoding scheme for Indic languages and ii) a grapheme parser for Abugida text. It deconstructs words into visually distinct orthographic syllables or complex graphemes and their constituents. Our proposed normalizer is a more efficient and effective tool than the previously used IndicNLP normalizer. Moreover, our parser and normalizer are also suitable tools for general Abugida text processing as they performed well in our robust word-based and NLP experiments. We report the pipeline for the scripts of 7 languages in this work and develop the framework for the integration of more scripts.
△ Less
Submitted 27 May, 2024; v1 submitted 11 May, 2023;
originally announced June 2023.
-
Implementation of the Trigonometric LMS Algorithm using Original Cordic Rotation
Authors:
Nasrin Akhter,
Kaniz Fatema,
Lilatul Ferdouse,
Faria Khandaker
Abstract:
The LMS algorithm is one of the most successful adaptive filtering algorithms. It uses the instantaneous value of the square of the error signal as an estimate of the mean-square error (MSE). The LMS algorithm changes (adapts) the filter tap weights so that the error signal is minimized in the mean square sense. In Trigonometric LMS (TLMS) and Hyperbolic LMS (HLMS), two new versions of LMS algorit…
▽ More
The LMS algorithm is one of the most successful adaptive filtering algorithms. It uses the instantaneous value of the square of the error signal as an estimate of the mean-square error (MSE). The LMS algorithm changes (adapts) the filter tap weights so that the error signal is minimized in the mean square sense. In Trigonometric LMS (TLMS) and Hyperbolic LMS (HLMS), two new versions of LMS algorithms, same formulations are performed as in the LMS algorithm with the exception that filter tap weights are now expressed using trigonometric and hyperbolic formulations, in cases for TLMS and HLMS respectively. Hence appears the CORDIC algorithm as it can efficiently perform trigonometric, hyperbolic, linear and logarithmic functions. While hardware-efficient algorithms often exist, the dominance of the software systems has kept those algorithms out of the spotlight. Among these hardware- efficient algorithms, CORDIC is an iterative solution for trigonometric and other transcendental functions. Former researches worked on CORDIC algorithm to observe the convergence behavior of Trigonometric LMS (TLMS) algorithm and obtained a satisfactory result in the context of convergence performance of TLMS algorithm. But revious researches directly used the CORDIC block output in their simulation ignoring the internal step-by-step rotations of the CORDIC processor. This gives rise to a need for verification of the convergence performance of the TLMS algorithm to investigate if it actually performs satisfactorily if implemented with step-by-step CORDIC rotation. This research work has done this job. It focuses on the internal operations of the CORDIC hardware, implements the Trigonometric LMS (TLMS) and Hyperbolic LMS (HLMS) algorithms using actual CORDIC rotations. The obtained simulation results are highly satisfactory and also it shows that convergence behavior of HLMS is much better than TLMS.
△ Less
Submitted 21 April, 2011; v1 submitted 18 August, 2010;
originally announced August 2010.