Search | arXiv e-print repository

M2DS: Multilingual Dataset for Multi-document Summarisation

Authors: Kushan Hewapathirana, Nisansa de Silva, C. D. Athuraliya

Abstract: In the rapidly evolving digital era, there is an increasing demand for concise information as individuals seek to distil key insights from various sources. Recent attention from researchers on Multi-document Summarisation (MDS) has resulted in diverse datasets covering customer reviews, academic papers, medical and legal documents, and news articles. However, the English-centric nature of these da… ▽ More In the rapidly evolving digital era, there is an increasing demand for concise information as individuals seek to distil key insights from various sources. Recent attention from researchers on Multi-document Summarisation (MDS) has resulted in diverse datasets covering customer reviews, academic papers, medical and legal documents, and news articles. However, the English-centric nature of these datasets has created a conspicuous void for multilingual datasets in today's globalised digital landscape, where linguistic diversity is celebrated. Media platforms such as British Broadcasting Corporation (BBC) have disseminated news in 20+ languages for decades. With only 380 million people speaking English natively as their first language, accounting for less than 5% of the global population, the vast majority primarily relies on other languages. These facts underscore the need for inclusivity in MDS research, utilising resources from diverse languages. Recognising this gap, we present the Multilingual Dataset for Multi-document Summarisation (M2DS), which, to the best of our knowledge, is the first dataset of its kind. It includes document-summary pairs in five languages from BBC articles published during the 2010-2023 period. This paper introduces M2DS, emphasising its unique multilingual aspect, and includes baseline scores from state-of-the-art MDS models evaluated on our dataset. △ Less

Submitted 17 July, 2024; originally announced July 2024.

arXiv:2407.02834 [pdf, ps, other]

Aspect-Based Sentiment Analysis Techniques: A Comparative Study

Authors: Dineth Jayakody, Koshila Isuranda, A V A Malkith, Nisansa de Silva, Sachintha Rajith Ponnamperuma, G G N Sandamali, K L K Sudheera

Abstract: Since the dawn of the digitalisation era, customer feedback and online reviews are unequivocally major sources of insights for businesses. Consequently, conducting comparative analyses of such sources has become the de facto modus operandi of any business that wishes to give itself a competitive edge over its peers and improve customer loyalty. Sentiment analysis is one such method instrumental in… ▽ More Since the dawn of the digitalisation era, customer feedback and online reviews are unequivocally major sources of insights for businesses. Consequently, conducting comparative analyses of such sources has become the de facto modus operandi of any business that wishes to give itself a competitive edge over its peers and improve customer loyalty. Sentiment analysis is one such method instrumental in gauging public interest, exposing market trends, and analysing competitors. While traditional sentiment analysis focuses on overall sentiment, as the needs advance with time, it has become important to explore public opinions and sentiments on various specific subjects, products and services mentioned in the reviews on a finer-granular level. To this end, Aspect-based Sentiment Analysis (ABSA), supported by advances in Artificial Intelligence (AI) techniques which have contributed to a paradigm shift from simple word-level analysis to tone and context-aware analyses, focuses on identifying specific aspects within the text and determining the sentiment associated with each aspect. In this study, we compare several deep-NN methods for ABSA on two benchmark datasets (Restaurant14 and Laptop-14) and found that FAST LSA obtains the best overall results of 87.6% and 82.6% accuracy but does not pass LSA+DeBERTa which reports 90.33% and 86.21% accuracy respectively. △ Less

Submitted 4 July, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.00407 [pdf, other]

doi 10.1109/ICIIS58898.2023.10253606

SHADE: Semantic Hypernym Annotator for Domain-specific Entities -- DnD Domain Use Case

Authors: Akila Peiris, Nisansa de Silva

Abstract: Manual data annotation is an important NLP task but one that takes considerable amount of resources and effort. In spite of the costs, labeling and categorizing entities is essential for NLP tasks such as semantic evaluation. Even though annotation can be done by non-experts in most cases, due to the fact that this requires human labor, the process is costly. Another major challenge encountered in… ▽ More Manual data annotation is an important NLP task but one that takes considerable amount of resources and effort. In spite of the costs, labeling and categorizing entities is essential for NLP tasks such as semantic evaluation. Even though annotation can be done by non-experts in most cases, due to the fact that this requires human labor, the process is costly. Another major challenge encountered in data annotation is maintaining the annotation consistency. Annotation efforts are typically carried out by teams of multiple annotators. The annotations need to maintain the consistency in relation to both the domain truth and annotation format while reducing human errors. Annotating a specialized domain that deviates significantly from the general domain, such as fantasy literature, will see a lot of human error and annotator disagreement. So it is vital that proper guidelines and error reduction mechanisms are enforced. One such way to enforce these constraints is using a specialized application. Such an app can ensure that the notations are consistent, and the labels can be pre-defined or restricted reducing the room for errors. In this paper, we present SHADE, an annotation software that can be used to annotate entities in the high fantasy literature domain. Specifically in Dungeons and Dragons lore extracted from the Forgotten Realms Fandom Wiki. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2406.06021 [pdf, other]

Shoulders of Giants: A Look at the Degree and Utility of Openness in NLP Research

Authors: Surangika Ranathunga, Nisansa de Silva, Dilith Jayakody, Aloka Fernando

Abstract: We analysed a sample of NLP research papers archived in ACL Anthology as an attempt to quantify the degree of openness and the benefit of such an open culture in the NLP community. We observe that papers published in different NLP venues show different patterns related to artefact reuse. We also note that more than 30% of the papers we analysed do not release their artefacts publicly, despite prom… ▽ More We analysed a sample of NLP research papers archived in ACL Anthology as an attempt to quantify the degree of openness and the benefit of such an open culture in the NLP community. We observe that papers published in different NLP venues show different patterns related to artefact reuse. We also note that more than 30% of the papers we analysed do not release their artefacts publicly, despite promising to do so. Further, we observe a wide language-wise disparity in publicly available NLP-related artefacts. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Will appear in ACL 2024

arXiv:2402.10662 [pdf, other]

doi 10.1109/MERCon60487.2023.10355501

Fine Tuning Named Entity Extraction Models for the Fantasy Domain

Authors: Aravinth Sivaganeshan, Nisansa de Silva

Abstract: Named Entity Recognition (NER) is a sequence classification Natural Language Processing task where entities are identified in the text and classified into predefined categories. It acts as a foundation for most information extraction systems. Dungeons and Dragons (D&D) is an open-ended tabletop fantasy game with its own diverse lore. DnD entities are domain-specific and are thus unrecognizable by… ▽ More Named Entity Recognition (NER) is a sequence classification Natural Language Processing task where entities are identified in the text and classified into predefined categories. It acts as a foundation for most information extraction systems. Dungeons and Dragons (D&D) is an open-ended tabletop fantasy game with its own diverse lore. DnD entities are domain-specific and are thus unrecognizable by even the state-of-the-art off-the-shelf NER systems as the NER systems are trained on general data for pre-defined categories such as: person (PERS), location (LOC), organization (ORG), and miscellaneous (MISC). For meaningful extraction of information from fantasy text, the entities need to be classified into domain-specific entity categories as well as the models be fine-tuned on a domain-relevant corpus. This work uses available lore of monsters in the D&D domain to fine-tune Trankit, which is a prolific NER framework that uses a pre-trained model for NER. Upon this training, the system acquires the ability to extract monster names from relevant domain documents under a novel NER tag. This work compares the accuracy of the monster name identification against; the zero-shot Trankit model and two FLAIR models. The fine-tuned Trankit model achieves an 87.86% F1 score surpassing all the other considered models. △ Less

Submitted 16 February, 2024; originally announced February 2024.

arXiv:2402.07446 [pdf, other]

Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora

Authors: Surangika Ranathunga, Nisansa de Silva, Menan Velayuthan, Aloka Fernando, Charitha Rathnayake

Abstract: We conducted a detailed analysis on the quality of web-mined corpora for two low-resource languages (making three language pairs, English-Sinhala, English-Tamil and Sinhala-Tamil). We ranked each corpus according to a similarity measure and carried out an intrinsic and extrinsic evaluation on different portions of this ranked corpus. We show that there are significant quality differences between d… ▽ More We conducted a detailed analysis on the quality of web-mined corpora for two low-resource languages (making three language pairs, English-Sinhala, English-Tamil and Sinhala-Tamil). We ranked each corpus according to a similarity measure and carried out an intrinsic and extrinsic evaluation on different portions of this ranked corpus. We show that there are significant quality differences between different portions of web-mined corpora and that the quality varies across languages and datasets. We also show that, for some web-mined datasets, Neural Machine Translation (NMT) models trained with their highest-ranked 25k portion can be on par with human-curated datasets. △ Less

Submitted 14 June, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

arXiv:2401.07356 [pdf, other]

BUGSPHP: A dataset for Automated Program Repair in PHP

Authors: K. D. Pramod, W. T. N. De Silva, W. U. K. Thabrew, Ridwan Shariffdeen, Sandareka Wickramanayake

Abstract: Automated Program Repair (APR) improves developer productivity by saving debugging and bug-fixing time. While APR has been extensively explored for C/C++ and Java programs, there is little research on bugs in PHP programs due to the lack of a benchmark PHP bug dataset. This is surprising given that PHP has been one of the most widely used server-side languages for over two decades, being used in a… ▽ More Automated Program Repair (APR) improves developer productivity by saving debugging and bug-fixing time. While APR has been extensively explored for C/C++ and Java programs, there is little research on bugs in PHP programs due to the lack of a benchmark PHP bug dataset. This is surprising given that PHP has been one of the most widely used server-side languages for over two decades, being used in a variety of contexts such as e-commerce, social networking, and content management. This paper presents a benchmark dataset of PHP bugs on real-world applications called BUGSPHP, which can enable research on analysis, testing, and repair for PHP programs. The dataset consists of training and test datasets, separately curated from GitHub and processed locally. The training dataset includes more than 600,000 bug-fixing commits. The test dataset contains 513 manually validated bug-fixing commits equipped with developer-provided test cases to assess patch correctness. △ Less

Submitted 21 January, 2024; v1 submitted 14 January, 2024; originally announced January 2024.

arXiv:2311.10436 [pdf, other]

Sinhala-English Word Embedding Alignment: Introducing Datasets and Benchmark for a Low Resource Language

Authors: Kasun Wickramasinghe, Nisansa de Silva

Abstract: Since their inception, embeddings have become a primary ingredient in many flavours of Natural Language Processing (NLP) tasks supplanting earlier types of representation. Even though multilingual embeddings have been used for the increasing number of multilingual tasks, due to the scarcity of parallel training data, low-resource languages such as Sinhala, tend to focus more on monolingual embeddi… ▽ More Since their inception, embeddings have become a primary ingredient in many flavours of Natural Language Processing (NLP) tasks supplanting earlier types of representation. Even though multilingual embeddings have been used for the increasing number of multilingual tasks, due to the scarcity of parallel training data, low-resource languages such as Sinhala, tend to focus more on monolingual embeddings. Then when it comes to the aforementioned multi-lingual tasks, it is challenging to utilize these monolingual embeddings given that even if the embedding spaces have a similar geometric arrangement due to an identical training process, the embeddings of the languages considered are not aligned. This is solved by the embedding alignment task. Even in this, high-resource language pairs are in the limelight while low-resource languages such as Sinhala which is in dire need of help seem to have fallen by the wayside. In this paper, we try to align Sinhala and English word embedding spaces based on available alignment techniques and introduce a benchmark for Sinhala language embedding alignment. In addition to that, to facilitate the supervised alignment, as an intermediate task, we also introduce Sinhala-English alignment datasets. These datasets serve as our anchor datasets for supervised word embedding alignment. Even though we do not obtain results comparable to the high-resource languages such as French, German, or Chinese, we believe our work lays the groundwork for more specialized alignment between English and Sinhala embeddings. △ Less

Submitted 17 November, 2023; originally announced November 2023.

Journal ref: https://aclanthology.org/2023.paclic-1.42/

arXiv:2311.10357 [pdf, ps, other]

Fast algorithms for classical specifications of stabiliser states and Clifford gates

Authors: Nadish de Silva, Wilfred Salmon, Ming Yin

Abstract: The stabiliser formalism plays a central role in quantum computing, error correction, and fault-tolerance. Stabiliser states are used to encode computational basis states. Clifford gates are those which can be easily performed fault-tolerantly in the most common error correction schemes. Their mathematical properties are the subject of significant research interest. Conversions between and verif… ▽ More The stabiliser formalism plays a central role in quantum computing, error correction, and fault-tolerance. Stabiliser states are used to encode computational basis states. Clifford gates are those which can be easily performed fault-tolerantly in the most common error correction schemes. Their mathematical properties are the subject of significant research interest. Conversions between and verifications of different specifications of stabiliser states and Clifford gates are important components of many classical algorithms in quantum information, e.g. for gate synthesis, circuit optimisation, and for simulating quantum circuits. These core functions are also used in the numerical experiments critical to formulating and testing mathematical conjectures on the stabiliser formalism. We develop novel mathematical insights concerning stabiliser states and Clifford gates that significantly clarify their descriptions. We then utilise these to provide ten new fast algorithms which offer asymptotic advantages over any existing implementations. We show how to rapidly verify that a vector is a stabiliser state, and interconvert between its specification as amplitudes, a quadratic form, and a check matrix. These methods are leveraged to rapidly check if a given unitary matrix is a Clifford gate and to interconvert between the matrix of a Clifford gate and its compact specification as a stabiliser tableau. For example, we extract the stabiliser tableau of a Clifford gate matrix with $N^2$ entries in $O(N \log N)$ time. Remarkably, it is not necessary to read all the elements of a Clifford matrix to extract its stabiliser tableau. This is an asymptotic speedup over the best-known method that is superexponential in the number of qubits. We provide example implementations of our algorithms in Python. △ Less

Submitted 26 May, 2024; v1 submitted 17 November, 2023; originally announced November 2023.

Comments: Python implementations available at https://github.com/ndesilva/stabiliser-tools. New in v2: new algorithm for extracting the stabiliser tableau of a Clifford gate matrix that is exponentially faster compared to v1, more thorough complexity analyses. New in v3: new and faster algorithms, comparisons with existing implementations

arXiv:2310.08083 [pdf, other]

On Using GUI Interaction Data to Improve Text Retrieval-based Bug Localization

Authors: Junayed Mahmud, Nadeeshan De Silva, Safwat Ali Khan, Seyed Hooman Mostafavi, SM Hasan Mansur, Oscar Chaparro, Andrian Marcus, Kevin Moran

Abstract: One of the most important tasks related to managing bug reports is localizing the fault so that a fix can be applied. As such, prior work has aimed to automate this task of bug localization by formulating it as an information retrieval problem, where potentially buggy files are retrieved and ranked according to their textual similarity with a given bug report. However, there is often a notable sem… ▽ More One of the most important tasks related to managing bug reports is localizing the fault so that a fix can be applied. As such, prior work has aimed to automate this task of bug localization by formulating it as an information retrieval problem, where potentially buggy files are retrieved and ranked according to their textual similarity with a given bug report. However, there is often a notable semantic gap between the information contained in bug reports and identifiers or natural language contained within source code files. For user-facing software, there is currently a key source of information that could aid in bug localization, but has not been thoroughly investigated - information from the GUI. We investigate the hypothesis that, for end user-facing applications, connecting information in a bug report with information from the GUI, and using this to aid in retrieving potentially buggy files, can improve upon existing techniques for bug localization. To examine this phenomenon, we conduct a comprehensive empirical study that augments four baseline techniques for bug localization with GUI interaction information from a reproduction scenario to (i) filter out potentially irrelevant files, (ii) boost potentially relevant files, and (iii) reformulate text-retrieval queries. To carry out our study, we source the current largest dataset of fully-localized and reproducible real bugs for Android apps, with corresponding bug reports, consisting of 80 bug reports from 39 popular open-source apps. Our results illustrate that augmenting traditional techniques with GUI information leads to a marked increase in effectiveness across multiple metrics, including a relative increase in Hits@10 of 13-18%. Additionally, through further analysis, we find that our studied augmentations largely complement existing techniques. △ Less

Submitted 12 October, 2023; originally announced October 2023.

Comments: 13 pages, to appear in the Proceedings of the 46th International Conference on Software Engineering (ICSE'24)

arXiv:2309.17171 [pdf, other]

Comparative Analysis of Named Entity Recognition in the Dungeons and Dragons Domain

Authors: Gayashan Weerasundara, Nisansa de Silva

Abstract: Many NLP tasks, although well-resolved for general English, face challenges in specific domains like fantasy literature. This is evident in Named Entity Recognition (NER), which detects and categorizes entities in text. We analyzed 10 NER models on 7 Dungeons and Dragons (D&D) adventure books to assess domain-specific performance. Using open-source Large Language Models, we annotated named entitie… ▽ More Many NLP tasks, although well-resolved for general English, face challenges in specific domains like fantasy literature. This is evident in Named Entity Recognition (NER), which detects and categorizes entities in text. We analyzed 10 NER models on 7 Dungeons and Dragons (D&D) adventure books to assess domain-specific performance. Using open-source Large Language Models, we annotated named entities in these books and evaluated each model's precision. Our findings indicate that, without modifications, Flair, Trankit, and Spacy outperform others in identifying named entities in the D&D context. △ Less

Submitted 29 September, 2023; originally announced September 2023.

Comments: 9 pages

arXiv:2309.04951 [pdf, ps, other]

doi 10.1109/ICIIS58898.2023.10253581

Multi-document Summarization: A Comparative Evaluation

Authors: Kushan Hewapathirana, Nisansa de Silva, C. D. Athuraliya

Abstract: This paper is aimed at evaluating state-of-the-art models for Multi-document Summarization (MDS) on different types of datasets in various domains and investigating the limitations of existing models to determine future research directions. To address this gap, we conducted an extensive literature review to identify state-of-the-art models and datasets. We analyzed the performance of PRIMERA and P… ▽ More This paper is aimed at evaluating state-of-the-art models for Multi-document Summarization (MDS) on different types of datasets in various domains and investigating the limitations of existing models to determine future research directions. To address this gap, we conducted an extensive literature review to identify state-of-the-art models and datasets. We analyzed the performance of PRIMERA and PEGASUS models on BigSurvey-MDS and MS$^2$ datasets, which posed unique challenges due to their varied domains. Our findings show that the General-Purpose Pre-trained Model LED outperforms PRIMERA and PEGASUS on the MS$^2$ dataset. We used the ROUGE score as a performance metric to evaluate the identified models on different datasets. Our study provides valuable insights into the models' strengths and weaknesses, as well as their applicability in different domains. This work serves as a reference for future MDS research and contributes to the development of accurate and robust models which can be utilized on demanding datasets with academically and/or scientifically complex data as well as generalized, relatively simple datasets. △ Less

Submitted 12 September, 2023; v1 submitted 10 September, 2023; originally announced September 2023.

arXiv:2308.02234 [pdf, other]

doi 10.1109/ICIIS58898.2023.10253560

Sinhala-English Parallel Word Dictionary Dataset

Authors: Kasun Wickramasinghe, Nisansa de Silva

Abstract: Parallel datasets are vital for performing and evaluating any kind of multilingual task. However, in the cases where one of the considered language pairs is a low-resource language, the existing top-down parallel data such as corpora are lacking in both tally and quality due to the dearth of human annotation. Therefore, for low-resource languages, it is more feasible to move in the bottom-up direc… ▽ More Parallel datasets are vital for performing and evaluating any kind of multilingual task. However, in the cases where one of the considered language pairs is a low-resource language, the existing top-down parallel data such as corpora are lacking in both tally and quality due to the dearth of human annotation. Therefore, for low-resource languages, it is more feasible to move in the bottom-up direction where finer granular pairs such as dictionary datasets are developed first. They may then be used for mid-level tasks such as supervised multilingual word embedding alignment. These in turn can later guide higher-level tasks in the order of aligning sentence or paragraph text corpora used for Machine Translation (MT). Even though more approachable than generating and aligning a massive corpus for a low-resource language, for the same reason of apathy from larger research entities, even these finer granular data sets are lacking for some low-resource languages. We have observed that there is no free and open dictionary data set for the low-resource language, Sinhala. Thus, in this work, we introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in multilingual Natural Language Processing (NLP) tasks related to English and Sinhala languages. In this paper, we explain the dataset creation pipeline as well as the experimental results of the tests we have carried out to verify the quality of the data sets. The data sets and the related scripts are available at https://github.com/kasunw22/sinhala-para-dict. △ Less

Submitted 4 August, 2023; originally announced August 2023.

arXiv:2302.06050 [pdf, other]

BURT: A Chatbot for Interactive Bug Reporting

Authors: Yang Song, Junayed Mahmud, Nadeeshan De Silva, Ying Zhou, Oscar Chaparro, Kevin Moran, Andrian Marcus, Denys Poshyvanyk

Abstract: This paper introduces BURT, a web-based chatbot for interactive reporting of Android app bugs. BURT is designed to assist Android app end-users in reporting high-quality defect information using an interactive interface. BURT guides the users in reporting essential bug report elements, i.e., the observed behavior, expected behavior, and the steps to reproduce the bug. It verifies the quality of th… ▽ More This paper introduces BURT, a web-based chatbot for interactive reporting of Android app bugs. BURT is designed to assist Android app end-users in reporting high-quality defect information using an interactive interface. BURT guides the users in reporting essential bug report elements, i.e., the observed behavior, expected behavior, and the steps to reproduce the bug. It verifies the quality of the text written by the user and provides instant feedback. In addition, BURT provides graphical suggestions that the users can choose as alternatives to textual descriptions. We empirically evaluated BURT, asking end-users to report bugs from six Android apps. The reporters found that BURT's guidance and automated suggestions and clarifications are useful and BURT is easy to use. BURT is an open-source tool, available at github.com/sea-lab-wm/burt/tree/tool-demo. A video showing the full capabilities of BURT can be found at https://youtu.be/SyfOXpHYGRo △ Less

Submitted 12 February, 2023; originally announced February 2023.

Comments: Accepted by the Demonstrations Track of the 45th International Conference on Software Engineering (ICSE'23). arXiv admin note: substantial text overlap with arXiv:2209.10062

arXiv:2212.09080 [pdf, other]

Synthesis and Evaluation of a Domain-specific Large Data Set for Dungeons & Dragons

Authors: Akila Peiris, Nisansa de Silva

Abstract: This paper introduces the Forgotten Realms Wiki (FRW) data set and domain specific natural language generation using FRW along with related analyses. Forgotten Realms is the de-facto default setting of the popular open ended tabletop fantasy role playing game, Dungeons & Dragons. The data set was extracted from the Forgotten Realms Fandom wiki consisting of more than over 45,200 articles. The FRW… ▽ More This paper introduces the Forgotten Realms Wiki (FRW) data set and domain specific natural language generation using FRW along with related analyses. Forgotten Realms is the de-facto default setting of the popular open ended tabletop fantasy role playing game, Dungeons & Dragons. The data set was extracted from the Forgotten Realms Fandom wiki consisting of more than over 45,200 articles. The FRW data set is constituted of 11 sub-data sets in a number of formats: raw plain text, plain text annotated by article title, directed link graphs, wiki info-boxes annotated by the wiki article title, Poincaré embedding of first link graph, multiple Word2Vec and Doc2Vec models of the corpus. This is the first data set of this size for the Dungeons & Dragons domain. We then present a pairwise similarity comparison benchmark which utilizes similarity measures. In addition, we perform D&D domain specific natural language generation using the corpus and evaluate the named entity classification with respect to the lore of Forgotten Realms. △ Less

Submitted 18 December, 2022; originally announced December 2022.

arXiv:2210.14472 [pdf, other]

Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource Languages

Authors: Gihan Weeraprameshwara, Vihanga Jayawickrama, Nisansa de Silva, Yudhanjaya Wijeratne

Abstract: In the process of numerically modeling natural languages, developing language embeddings is a vital step. However, it is challenging to develop functional embeddings for resource-poor languages such as Sinhala, for which sufficiently large corpora, effective language parsers, and any other required resources are difficult to find. In such conditions, the exploitation of existing models to come up… ▽ More In the process of numerically modeling natural languages, developing language embeddings is a vital step. However, it is challenging to develop functional embeddings for resource-poor languages such as Sinhala, for which sufficiently large corpora, effective language parsers, and any other required resources are difficult to find. In such conditions, the exploitation of existing models to come up with an efficacious embedding methodology to numerically represent text could be quite fruitful. This paper explores the effectivity of several one-tiered and two-tiered embedding architectures in representing Sinhala text in the sentiment analysis domain. With our findings, the two-tiered embedding architecture where the lower-tier consists of a word embedding and the upper-tier consists of a sentence embedding has been proven to perform better than one-tier word embeddings, by achieving a maximum F1 score of 88.04% in contrast to the 83.76% achieved by word embedding models. Furthermore, embeddings in the hyperbolic space are also developed and compared with Euclidean embeddings in terms of performance. A sentiment data set consisting of Facebook posts and associated reactions have been used for this research. To effectively compare the performance of different embedding systems, the same deep neural network structure has been trained on sentiment data with each of the embedding systems used to encode the text associated. △ Less

Submitted 26 October, 2022; originally announced October 2022.

arXiv:2210.08523 [pdf, other]

Some Languages are More Equal than Others: Probing Deeper into the Linguistic Disparity in the NLP World

Authors: Surangika Ranathunga, Nisansa de Silva

Abstract: Linguistic disparity in the NLP world is a problem that has been widely acknowledged recently. However, different facets of this problem, or the reasons behind this disparity are seldom discussed within the NLP community. This paper provides a comprehensive analysis of the disparity that exists within the languages of the world. We show that simply categorising languages considering data availabil… ▽ More Linguistic disparity in the NLP world is a problem that has been widely acknowledged recently. However, different facets of this problem, or the reasons behind this disparity are seldom discussed within the NLP community. This paper provides a comprehensive analysis of the disparity that exists within the languages of the world. We show that simply categorising languages considering data availability may not be always correct. Using an existing language categorisation based on speaker population and vitality, we analyse the distribution of language data resources, amount of NLP/CL research, inclusion in multilingual web-based platforms and the inclusion in pre-trained multilingual models. We show that many languages do not get covered in these resources or platforms, and even within the languages belonging to the same language group, there is wide disparity. We analyse the impact of family, geographical location, GDP and the speaker population of languages and provide possible reasons for this disparity, along with some suggestions to overcome the same. △ Less

Submitted 19 October, 2022; v1 submitted 16 October, 2022; originally announced October 2022.

arXiv:2202.03457 [pdf, other]

doi 10.1109/MERCon55799.2022.9906176

Selecting Seed Words for Wordle using Character Statistics

Authors: Nisansa de Silva

Abstract: Wordle, a word guessing game rose to global popularity in the January of 2022. The goal of the game is to guess a five-letter English word within six tries. Each try provides the player with hints by means of colour changing tiles which inform whether or not a given character is part of the solution as well as, in cases where it is part of the solution, whether or not it is in the correct placemen… ▽ More Wordle, a word guessing game rose to global popularity in the January of 2022. The goal of the game is to guess a five-letter English word within six tries. Each try provides the player with hints by means of colour changing tiles which inform whether or not a given character is part of the solution as well as, in cases where it is part of the solution, whether or not it is in the correct placement. Numerous attempts have been made to find the best starting word and best strategy to solve the daily wordle. This study uses character statistics of five-letter words to determine the best three starting words. △ Less

Submitted 6 February, 2024; v1 submitted 7 February, 2022; originally announced February 2022.

arXiv:2201.03941 [pdf, other]

doi 10.1145/3512826.3512829

Sentiment Analysis with Deep Learning Models: A Comparative Study on a Decade of Sinhala Language Facebook Data

Authors: Gihan Weeraprameshwara, Vihanga Jayawickrama, Nisansa de Silva, Yudhanjaya Wijeratne

Abstract: The relationship between Facebook posts and the corresponding reaction feature is an interesting subject to explore and understand. To achieve this end, we test state-of-the-art Sinhala sentiment analysis models against a data set containing a decade worth of Sinhala posts with millions of reactions. For the purpose of establishing benchmarks and with the goal of identifying the best model for Sin… ▽ More The relationship between Facebook posts and the corresponding reaction feature is an interesting subject to explore and understand. To achieve this end, we test state-of-the-art Sinhala sentiment analysis models against a data set containing a decade worth of Sinhala posts with millions of reactions. For the purpose of establishing benchmarks and with the goal of identifying the best model for Sinhala sentiment analysis, we also test, on the same data set configuration, other deep learning models catered for sentiment analysis. In this study we report that the 3 layer Bidirectional LSTM model achieves an F1 score of 84.58% for Sinhala sentiment analysis, surpassing the current state-of-the-art model; Capsule B, which only manages to get an F1 score of 82.04%. Further, since all the deep learning models show F1 scores above 75% we conclude that it is safe to claim that Facebook reactions are suitable to predict the sentiment of a text. △ Less

Submitted 13 January, 2022; v1 submitted 11 January, 2022; originally announced January 2022.

Comments: 8 pages, LaTeX; typos corrected

arXiv:2112.00468 [pdf, ps, other]

doi 10.1109/ICter53630.2021.9774796

Seeking Sinhala Sentiment: Predicting Facebook Reactions of Sinhala Posts

Authors: Vihanga Jayawickrama, Gihan Weeraprameshwara, Nisansa de Silva, Yudhanjaya Wijeratne

Abstract: The Facebook network allows its users to record their reactions to text via a typology of emotions. This network, taken at scale, is therefore a prime data set of annotated sentiment data. This paper uses millions of such reactions, derived from a decade worth of Facebook post data centred around a Sri Lankan context, to model an eye of the beholder approach to sentiment detection for online Sinha… ▽ More The Facebook network allows its users to record their reactions to text via a typology of emotions. This network, taken at scale, is therefore a prime data set of annotated sentiment data. This paper uses millions of such reactions, derived from a decade worth of Facebook post data centred around a Sri Lankan context, to model an eye of the beholder approach to sentiment detection for online Sinhala textual content. Three different sentiment analysis models are built, taking into account a limited subset of reactions, all reactions, and another that derives a positive/negative star rating value. The efficacy of these models in capturing the reactions of the observers are then computed and discussed. The analysis reveals that binary classification of reactions, for Sinhala content, is significantly more accurate than the other approaches. Furthermore, the inclusion of the like reaction hinders the capability of accurately predicting other reactions. △ Less

Submitted 1 December, 2021; originally announced December 2021.

arXiv:2111.05721 [pdf, other]

Critical Sentence Identification in Legal Cases Using Multi-Class Classification

Authors: Sahan Jayasinghe, Lakith Rambukkanage, Ashan Silva, Nisansa de Silva, Amal Shehan Perera

Abstract: Inherently, the legal domain contains a vast amount of data in text format. Therefore it requires the application of Natural Language Processing (NLP) to cater to the analytically demanding needs of the domain. The advancement of NLP is spreading through various domains, such as the legal domain, in forms of practical applications and academic research. Identifying critical sentences, facts and ar… ▽ More Inherently, the legal domain contains a vast amount of data in text format. Therefore it requires the application of Natural Language Processing (NLP) to cater to the analytically demanding needs of the domain. The advancement of NLP is spreading through various domains, such as the legal domain, in forms of practical applications and academic research. Identifying critical sentences, facts and arguments in a legal case is a tedious task for legal professionals. In this research we explore the usage of sentence embeddings for multi-class classification to identify critical sentences in a legal case, in the perspective of the main parties present in the case. In addition, a task-specific loss function is defined in order to improve the accuracy restricted by the straightforward use of categorical cross entropy loss. △ Less

Submitted 14 November, 2021; v1 submitted 10 November, 2021; originally announced November 2021.

arXiv:2103.12028 [pdf, other]

doi 10.1162/tacl_a_00447

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Authors: Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller , et al. (27 additional authors not shown)

Abstract: With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have system… ▽ More With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases. △ Less

Submitted 21 February, 2022; v1 submitted 22 March, 2021; originally announced March 2021.

Comments: Accepted at TACL; pre-MIT Press publication version

Journal ref: Transactions of the Association for Computational Linguistics (2022) 10: 50-72

arXiv:2011.06326 [pdf, other]

doi 10.1109/ICIIS51140.2020.9342650

SigmaLaw-ABSA: Dataset for Aspect-Based Sentiment Analysis in Legal Opinion Texts

Authors: Chanika Ruchini Mudalige, Dilini Karunarathna, Isanka Rajapaksha, Nisansa de Silva, Gathika Ratnayaka, Amal Shehan Perera, Ramesh Pathirana

Abstract: Aspect-Based Sentiment Analysis (ABSA) has been prominent and ongoing research over many different domains, but it is not widely discussed in the legal domain. A number of publicly available datasets for a wide range of domains usually fulfill the needs of researchers to perform their studies in the field of ABSA. To the best of our knowledge, there is no publicly available dataset for the Aspect… ▽ More Aspect-Based Sentiment Analysis (ABSA) has been prominent and ongoing research over many different domains, but it is not widely discussed in the legal domain. A number of publicly available datasets for a wide range of domains usually fulfill the needs of researchers to perform their studies in the field of ABSA. To the best of our knowledge, there is no publicly available dataset for the Aspect (Party) Based Sentiment Analysis for legal opinion texts. Therefore, creating a publicly available dataset for the research of ABSA for the legal domain can be considered as a task with significant importance. In this study, we introduce a manually annotated legal opinion text dataset (SigmaLaw-ABSA) intended towards facilitating researchers for ABSA tasks in the legal domain. SigmaLaw-ABSA consists of legal opinion texts in the English language which have been annotated by human judges. This study discusses the sub-tasks of ABSA relevant to the legal domain and how to use the dataset to perform them. This paper also describes the statistics of the dataset and as a baseline, we present some results on the performance of some existing deep learning based systems on the SigmaLaw-ABSA dataset. △ Less

Submitted 12 November, 2020; originally announced November 2020.

Comments: 6 pages, 2 figures, IEEE International Conference on Industrial and Information Systems(ICIIS) 2020

arXiv:2011.05675 [pdf, other]

doi 10.1109/ICTer51097.2020.9325435

Rule-Based Approach for Party-Based Sentiment Analysis in Legal Opinion Texts

Authors: Isanka Rajapaksha, Chanika Ruchini Mudalige, Dilini Karunarathna, Nisansa de Silva, Gathika Ratnayaka, Amal Shehan Perera

Abstract: A document which elaborates opinions and arguments related to the previous court cases is known as a legal opinion text. Lawyers and legal officials have to spend considerable effort and time to obtain the required information manually from those documents when dealing with new legal cases. Hence, it provides much convenience to those individuals if there is a way to automate the process of extrac… ▽ More A document which elaborates opinions and arguments related to the previous court cases is known as a legal opinion text. Lawyers and legal officials have to spend considerable effort and time to obtain the required information manually from those documents when dealing with new legal cases. Hence, it provides much convenience to those individuals if there is a way to automate the process of extracting information from legal opinion texts. Party-based sentiment analysis will play a key role in the automation system by identifying opinion values with respect to each legal parties in legal texts. △ Less

Submitted 13 November, 2020; v1 submitted 11 November, 2020; originally announced November 2020.

Comments: 2 pages, 1 figure, The 20th International Conference on Advances in ICT for Emerging Regions (ICTer2020)

arXiv:2011.00318 [pdf, ps, other]

Effective Approach to Develop a Sentiment Annotator For Legal Domain in a Low Resource Setting

Authors: Gathika Ratnayaka, Nisansa de Silva, Amal Shehan Perera, Ramesh Pathirana

Abstract: Analyzing the sentiments of legal opinions available in Legal Opinion Texts can facilitate several use cases such as legal judgement prediction, contradictory statements identification and party-based sentiment analysis. However, the task of developing a legal domain specific sentiment annotator is challenging due to resource constraints such as lack of domain specific labelled data and domain exp… ▽ More Analyzing the sentiments of legal opinions available in Legal Opinion Texts can facilitate several use cases such as legal judgement prediction, contradictory statements identification and party-based sentiment analysis. However, the task of developing a legal domain specific sentiment annotator is challenging due to resource constraints such as lack of domain specific labelled data and domain expertise. In this study, we propose novel techniques that can be used to develop a sentiment annotator for the legal domain while minimizing the need for manual annotations of data. △ Less

Submitted 31 October, 2020; originally announced November 2020.

arXiv:2007.07884 [pdf]

Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook

Authors: Yudhanjaya Wijeratne, Nisansa de Silva

Abstract: This paper presents two colloquial Sinhala language corpora from the language efforts of the Data, Analysis and Policy team of LIRNEasia, as well as a list of algorithmically derived stopwords. The larger of the two corpora spans 2010 to 2020 and contains 28,825,820 to 29,549,672 words of multilingual text posted by 533 Sri Lankan Facebook pages, including politics, media, celebrities, and other c… ▽ More This paper presents two colloquial Sinhala language corpora from the language efforts of the Data, Analysis and Policy team of LIRNEasia, as well as a list of algorithmically derived stopwords. The larger of the two corpora spans 2010 to 2020 and contains 28,825,820 to 29,549,672 words of multilingual text posted by 533 Sri Lankan Facebook pages, including politics, media, celebrities, and other categories; the smaller corpus amounts to 5,402,76 words of only Sinhala text extracted from the larger. Both corpora have markers for their date of creation, page of origin, and content type. △ Less

Submitted 15 July, 2020; originally announced July 2020.

Comments: 10 pages; Github repo of data linked in summary

arXiv:1908.09775 [pdf, other]

Multi-Path Learnable Wavelet Neural Network for Image Classification

Authors: D. D. N. De Silva, H. W. M. K. Vithanage, K. S. D. Fernando, I. T. S. Piyatilake

Abstract: Despite the remarkable success of deep learning in pattern recognition, deep network models face the problem of training a large number of parameters. In this paper, we propose and evaluate a novel multi-path wavelet neural network architecture for image classification with far less number of trainable parameters. The model architecture consists of a multi-path layout with several levels of wavele… ▽ More Despite the remarkable success of deep learning in pattern recognition, deep network models face the problem of training a large number of parameters. In this paper, we propose and evaluate a novel multi-path wavelet neural network architecture for image classification with far less number of trainable parameters. The model architecture consists of a multi-path layout with several levels of wavelet decompositions performed in parallel followed by fully connected layers. These decomposition operations comprise wavelet neurons with learnable parameters, which are updated during the training phase using the back-propagation algorithm. We evaluate the performance of the introduced network using common image datasets without data augmentation except for SVHN and compare the results with influential deep learning models. Our findings support the possibility of reducing the number of parameters significantly in deep neural networks without compromising its accuracy. △ Less

Submitted 26 August, 2019; originally announced August 2019.

arXiv:1906.02430 [pdf, other]

Shift-of-Perspective Identification Within Legal Cases

Authors: Gathika Ratnayaka, Thejan Rupasinghe, Nisansa de Silva, Viraj Salaka Gamage, Menuka Warushavithana, Amal Shehan Perera

Abstract: Arguments, counter-arguments, facts, and evidence obtained via documents related to previous court cases are of essential need for legal professionals. Therefore, the process of automatic information extraction from documents containing legal opinions related to court cases can be considered to be of significant importance. This study is focused on the identification of sentences in legal opinion… ▽ More Arguments, counter-arguments, facts, and evidence obtained via documents related to previous court cases are of essential need for legal professionals. Therefore, the process of automatic information extraction from documents containing legal opinions related to court cases can be considered to be of significant importance. This study is focused on the identification of sentences in legal opinion texts which convey different perspectives on a certain topic or entity. We combined several approaches based on semantic analysis, open information extraction, and sentiment analysis to achieve our objective. Then, our methodology was evaluated with the help of human judges. The outcomes of the evaluation demonstrate that our system is successful in detecting situations where two sentences deliver different opinions on the same topic or entity. The proposed methodology can be used to facilitate other information extraction tasks related to the legal domain. One such task is the automated detection of counter arguments for a given argument. Another is the identification of opponent parties in a court case. △ Less

Submitted 17 August, 2019; v1 submitted 6 June, 2019; originally announced June 2019.

arXiv:1906.02358 [pdf, other]

Survey on Publicly Available Sinhala Natural Language Processing Tools and Research

Authors: Nisansa de Silva

Abstract: Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in the perspective of Natural Language Processing tools and research, remains a resource-poor language which has neither the economic drive its cou… ▽ More Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in the perspective of Natural Language Processing tools and research, remains a resource-poor language which has neither the economic drive its cousin English has nor the sheer push of the law of numbers a language such as Chinese has. A number of research groups from Sri Lanka have noticed this dearth and the resultant dire need for proper tools and research for Sinhala natural language processing. However, due to various reasons, these attempts seem to lack coordination and awareness of each other. The objective of this paper is to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that the researchers working in this field can better utilize contributions of their peers. As such, we shall be uploading this paper to arXiv and perpetually update it periodically to reflect the advances made in the field. △ Less

Submitted 19 April, 2024; v1 submitted 5 June, 2019; originally announced June 2019.

arXiv:1903.03772 [pdf, ps, other]

Logic Rules Powered Knowledge Graph Embedding

Authors: Pengwei Wang, Dejing Dou, Fangzhao Wu, Nisansa de Silva, Lianwen Jin

Abstract: Large scale knowledge graph embedding has attracted much attention from both academia and industry in the field of Artificial Intelligence. However, most existing methods concentrate solely on fact triples contained in the given knowledge graph. Inspired by the fact that logic rules can provide a flexible and declarative language for expressing rich background knowledge, it is natural to integrate… ▽ More Large scale knowledge graph embedding has attracted much attention from both academia and industry in the field of Artificial Intelligence. However, most existing methods concentrate solely on fact triples contained in the given knowledge graph. Inspired by the fact that logic rules can provide a flexible and declarative language for expressing rich background knowledge, it is natural to integrate logic rules into knowledge graph embedding, to transfer human knowledge to entity and relation embedding, and strengthen the learning process. In this paper, we propose a novel logic rule-enhanced method which can be easily integrated with any translation based knowledge graph embedding model, such as TransE . We first introduce a method to automatically mine the logic rules and corresponding confidences from the triples. And then, to put both triples and mined logic rules within the same semantic space, all triples in the knowledge graph are represented as first-order logic. Finally, we define several operations on the first-order logic and minimize a global loss over both of the mined logic rules and the transformed first-order logics. We conduct extensive experiments for link prediction and triple classification on three datasets: WN18, FB166, and FB15K. Experiments show that the rule-enhanced method can significantly improve the performance of several baselines. The highlight of our model is that the filtered Hits@1, which is a pivotal evaluation in the knowledge inference task, has a significant improvement (up to 700% improvement). △ Less

Submitted 9 March, 2019; originally announced March 2019.

arXiv:1810.01912 [pdf, other]

Fast Approach to Build an Automatic Sentiment Annotator for Legal Domain using Transfer Learning

Authors: Viraj Gamage, Menuka Warushavithana, Nisansa de Silva, Amal Shehan Perera, Gathika Ratnayaka, Thejan Rupasinghe

Abstract: This study proposes a novel way of identifying the sentiment of the phrases used in the legal domain. The added complexity of the language used in law, and the inability of the existing systems to accurately predict the sentiments of words in law are the main motivations behind this study. This is a transfer learning approach, which can be used for other domain adaptation tasks as well. The propos… ▽ More This study proposes a novel way of identifying the sentiment of the phrases used in the legal domain. The added complexity of the language used in law, and the inability of the existing systems to accurately predict the sentiments of words in law are the main motivations behind this study. This is a transfer learning approach, which can be used for other domain adaptation tasks as well. The proposed methodology achieves an improvement of over 6\% compared to the source model's accuracy in the legal domain. △ Less

Submitted 3 October, 2018; originally announced October 2018.

Comments: 9 pages, 3 figures

arXiv:1809.03416 [pdf, ps, other]

doi 10.1109/ICTER.2018.8615485

Identifying Relationships Among Sentences in Court Case Transcripts Using Discourse Relations

Authors: Gathika Ratnayaka, Thejan Rupasinghe, Nisansa de Silva, Menuka Warushavithana, Viraj Gamage, Amal Shehan Perera

Abstract: Case Law has a significant impact on the proceedings of legal cases. Therefore, the information that can be obtained from previous court cases is valuable to lawyers and other legal officials when performing their duties. This paper describes a methodology of applying discourse relations between sentences when processing text documents related to the legal domain. In this study, we developed a mec… ▽ More Case Law has a significant impact on the proceedings of legal cases. Therefore, the information that can be obtained from previous court cases is valuable to lawyers and other legal officials when performing their duties. This paper describes a methodology of applying discourse relations between sentences when processing text documents related to the legal domain. In this study, we developed a mechanism to classify the relationships that can be observed among sentences in transcripts of United States court cases. First, we defined relationship types that can be observed between sentences in court case transcripts. Then we classified pairs of sentences according to the relationship type by combining a machine learning model and a rule-based approach. The results obtained through our system were evaluated using human judges. To the best of our knowledge, this is the first study where discourse relationships between sentences have been used to determine relationships among sentences in legal court case transcripts. △ Less

Submitted 14 September, 2018; v1 submitted 10 September, 2018; originally announced September 2018.

Comments: Conference: 2018 International Conference on Advances in ICT for Emerging Regions (ICTer)

arXiv:1809.00982 [pdf, other]

Wavelet based edge feature enhancement for convolutional neural networks

Authors: D. D. N. De Silva, S. Fernando, I. T. S. Piyatilake, A. V. S. Karunarathne

Abstract: Convolutional neural networks are able to perform a hierarchical learning process starting with local features. However, a limited attention is paid to enhancing such elementary level features like edges. We propose and evaluate two wavelet-based edge feature enhancement methods to preprocess the input images to convolutional neural networks. The first method develops feature enhanced representati… ▽ More Convolutional neural networks are able to perform a hierarchical learning process starting with local features. However, a limited attention is paid to enhancing such elementary level features like edges. We propose and evaluate two wavelet-based edge feature enhancement methods to preprocess the input images to convolutional neural networks. The first method develops feature enhanced representations by decomposing the input images using wavelet transform and limited reconstructing subsequently. The second method develops such feature enhanced inputs to the network using local modulus maxima of wavelet coefficients. For each method, we have developed a new preprocessing layer by implementing each purposed method and have appended to the network architecture. Our empirical evaluations demonstrate that the proposed methods are outperforming the baselines and previously published work with significant accuracy gains. △ Less

Submitted 4 February, 2019; v1 submitted 29 August, 2018; originally announced September 2018.

arXiv:1808.01766 [pdf]

On Optimizing Deep Convolutional Neural Networks by Evolutionary Computing

Authors: M. U. B. Dias, D. D. N. De Silva, S. Fernando

Abstract: Optimization for deep networks is currently a very active area of research. As neural networks become deeper, the ability in manually optimizing the network becomes harder. Mini-batch normalization, identification of effective respective fields, momentum updates, introduction of residual blocks, learning rate adoption, etc. have been proposed to speed up the rate of convergent in manual training p… ▽ More Optimization for deep networks is currently a very active area of research. As neural networks become deeper, the ability in manually optimizing the network becomes harder. Mini-batch normalization, identification of effective respective fields, momentum updates, introduction of residual blocks, learning rate adoption, etc. have been proposed to speed up the rate of convergent in manual training process while keeping the higher accuracy level. However, the problem of finding optimal topological structure for a given problem is becoming a challenging task need to be addressed immediately. Few researchers have attempted to optimize the network structure using evolutionary computing approaches. Among them, few have successfully evolved networks with reinforcement learning and long-short-term memory. A very few has applied evolutionary programming into deep convolution neural networks. These attempts are mainly evolved the network structure and then subsequently optimized the hyper-parameters of the network. However, a mechanism to evolve the deep network structure under the techniques currently being practiced in manual process is still absent. Incorporation of such techniques into chromosomes level of evolutionary computing, certainly can take us to better topological deep structures. The paper concludes by identifying the gap between evolutionary based deep neural networks and deep neural networks. Further, it proposes some insights for optimizing deep neural networks using evolutionary computing techniques. △ Less

Submitted 6 August, 2018; originally announced August 2018.

arXiv:1805.10685 [pdf, other]

Legal Document Retrieval using Document Vector Embeddings and Deep Learning

Authors: Keet Sugathadasa, Buddhi Ayesha, Nisansa de Silva, Amal Shehan Perera, Vindula Jayawardana, Dimuthu Lakmal, Madhavi Perera

Abstract: Domain specific information retrieval process has been a prominent and ongoing research in the field of natural language processing. Many researchers have incorporated different techniques to overcome the technical and domain specificity and provide a mature model for various domains of interest. The main bottleneck in these studies is the heavy coupling of domain experts, that makes the entire pr… ▽ More Domain specific information retrieval process has been a prominent and ongoing research in the field of natural language processing. Many researchers have incorporated different techniques to overcome the technical and domain specificity and provide a mature model for various domains of interest. The main bottleneck in these studies is the heavy coupling of domain experts, that makes the entire process to be time consuming and cumbersome. In this study, we have developed three novel models which are compared against a golden standard generated via the on line repositories provided, specifically for the legal domain. The three different models incorporated vector space representations of the legal domain, where document vector generation was done in two different mechanisms and as an ensemble of the above two. This study contains the research being carried out in the process of representing legal case documents into different vector spaces, whilst incorporating semantic word measures and natural language processing techniques. The ensemble model built in this study, shows a significantly higher accuracy level, which indeed proves the need for incorporation of domain specific semantic similarity measures into the information retrieval process. This study also shows, the impact of varying distribution of the word similarity measures, against varying document vector dimensions, which can lead to improvements in the process of legal information retrieval. △ Less

Submitted 27 May, 2018; originally announced May 2018.

arXiv:1709.02911 [pdf, other]

doi 10.1109/ICTER.2017.8257822

Semi-Supervised Instance Population of an Ontology using Word Vector Embeddings

Authors: Vindula Jayawardana, Dimuthu Lakmal, Nisansa de Silva, Amal Shehan Perera, Keet Sugathadasa, Buddhi Ayesha, Madhavi Perera

Abstract: In many modern day systems such as information extraction and knowledge management agents, ontologies play a vital role in maintaining the concept hierarchies of the selected domain. However, ontology population has become a problematic process due to its nature of heavy coupling with manual human intervention. With the use of word embeddings in the field of natural language processing, it became… ▽ More In many modern day systems such as information extraction and knowledge management agents, ontologies play a vital role in maintaining the concept hierarchies of the selected domain. However, ontology population has become a problematic process due to its nature of heavy coupling with manual human intervention. With the use of word embeddings in the field of natural language processing, it became a popular topic due to its ability to cope up with semantic sensitivity. Hence, in this study, we propose a novel way of semi-supervised ontology population through word embeddings as the basis. We built several models including traditional benchmark models and new types of models which are based on word embeddings. Finally, we ensemble them together to come up with a synergistic model with better accuracy. We demonstrate that our ensemble model can outperform the individual models. △ Less

Submitted 9 September, 2017; originally announced September 2017.

arXiv:1709.00013 [pdf, other]

Logical paradoxes in quantum computation

Authors: Nadish de Silva

Abstract: While quantum computers are expected to yield considerable advantages over classical devices, the precise features of quantum theory enabling these advantages remain unclear. Contextuality--the denial of a notion of classical physical reality--has emerged as a promising hypothesis. Magic states are quantum resources critical for practically achieving universal quantum computation. They exhibit t… ▽ More While quantum computers are expected to yield considerable advantages over classical devices, the precise features of quantum theory enabling these advantages remain unclear. Contextuality--the denial of a notion of classical physical reality--has emerged as a promising hypothesis. Magic states are quantum resources critical for practically achieving universal quantum computation. They exhibit the standard form of contextuality that is known to enable probabilistic advantages in a variety of computational and communicational tasks. Strong contextuality is an extremal form of contextuality describing systems that exhibit logically paradoxical behaviour. Here, we consider special magic states that deterministically enable quantum computation. After introducing number-theoretic techniques for constructing exotic quantum paradoxes, we present large classes of strongly contextual magic states that enable deterministic implementation of gates from the Clifford hierarchy. These surprising discoveries bolster a refinement of the resource theory of contextuality that emphasises the computational power of logical paradoxes. △ Less

Submitted 7 June, 2018; v1 submitted 31 August, 2017; originally announced September 2017.

Comments: To appear in the Proceedings of the Thirty-Third Annual ACM/IEEE Symposium on Logic in Computer Science (LICS 2018 - Oxford, UK)

arXiv:1706.02909 [pdf, other]

doi 10.1109/INTECH.2017.8102426

Deriving a Representative Vector for Ontology Classes with Instance Word Vector Embeddings

Authors: Vindula Jayawardana, Dimuthu Lakmal, Nisansa de Silva, Amal Shehan Perera, Keet Sugathadasa, Buddhi Ayesha

Abstract: Selecting a representative vector for a set of vectors is a very common requirement in many algorithmic tasks. Traditionally, the mean or median vector is selected. Ontology classes are sets of homogeneous instance objects that can be converted to a vector space by word vector embeddings. This study proposes a methodology to derive a representative vector for ontology classes whose instances were… ▽ More Selecting a representative vector for a set of vectors is a very common requirement in many algorithmic tasks. Traditionally, the mean or median vector is selected. Ontology classes are sets of homogeneous instance objects that can be converted to a vector space by word vector embeddings. This study proposes a methodology to derive a representative vector for ontology classes whose instances were converted to the vector space. We start by deriving five candidate vectors which are then used to train a machine learning model that would calculate a representative vector for the class. We show that our methodology out-performs the traditional mean and median vector representations. △ Less

Submitted 7 June, 2017; originally announced June 2017.

arXiv:1706.01967 [pdf, other]

doi 10.1109/ICIINFS.2017.8300343

Synergistic Union of Word2Vec and Lexicon for Domain Specific Semantic Similarity

Authors: Keet Sugathadasa, Buddhi Ayesha, Nisansa de Silva, Amal Shehan Perera, Vindula Jayawardana, Dimuthu Lakmal, Madhavi Perera

Abstract: Semantic similarity measures are an important part in Natural Language Processing tasks. However Semantic similarity measures built for general use do not perform well within specific domains. Therefore in this study we introduce a domain specific semantic similarity measure that was created by the synergistic union of word2vec, a word embedding method that is used for semantic similarity calculat… ▽ More Semantic similarity measures are an important part in Natural Language Processing tasks. However Semantic similarity measures built for general use do not perform well within specific domains. Therefore in this study we introduce a domain specific semantic similarity measure that was created by the synergistic union of word2vec, a word embedding method that is used for semantic similarity calculation and lexicon based (lexical) semantic similarity methods. We prove that this proposed methodology out performs word embedding methods trained on generic corpus and methods trained on domain specific corpus but do not use lexical semantic similarity methods to augment the results. Further, we prove that text lemmatization can improve the performance of word embedding methods. △ Less

Submitted 8 June, 2017; v1 submitted 6 June, 2017; originally announced June 2017.

Comments: 6 Pages, 3 figures

arXiv:1705.09995 [pdf]

Subject Specific Stream Classification Preprocessing Algorithm for Twitter Data Stream

Authors: Nisansa de Silva, Danaja Maldeniya, Chamilka Wijeratne

Abstract: Micro-blogging service Twitter is a lucrative source for data mining applications on global sentiment. But due to the omnifariousness of the subjects mentioned in each data item; it is inefficient to run a data mining algorithm on the raw data. This paper discusses an algorithm to accurately classify the entire stream in to a given number of mutually exclusive collectively exhaustive streams upon… ▽ More Micro-blogging service Twitter is a lucrative source for data mining applications on global sentiment. But due to the omnifariousness of the subjects mentioned in each data item; it is inefficient to run a data mining algorithm on the raw data. This paper discusses an algorithm to accurately classify the entire stream in to a given number of mutually exclusive collectively exhaustive streams upon each of which the data mining algorithm can be run separately yielding more relevant results with a high efficiency. △ Less

Submitted 28 May, 2017; originally announced May 2017.

Comments: 6 pages

arXiv:1705.07310 [pdf, ps, other]

doi 10.4230/LIPIcs.MFCS.2017.35

The Quantum Monad on Relational Structures

Authors: Samson Abramsky, Rui Soares Barbosa, Nadish de Silva, Octavio Zapata

Abstract: Homomorphisms between relational structures play a central role in finite model theory, constraint satisfaction and database theory. A central theme in quantum computation is to show how quantum resources can be used to gain advantage in information processing tasks. In particular, non-local games have been used to exhibit quantum advantage in boolean constraint satisfaction, and to obtain quantum… ▽ More Homomorphisms between relational structures play a central role in finite model theory, constraint satisfaction and database theory. A central theme in quantum computation is to show how quantum resources can be used to gain advantage in information processing tasks. In particular, non-local games have been used to exhibit quantum advantage in boolean constraint satisfaction, and to obtain quantum versions of graph invariants such as the chromatic number. We show how quantum strategies for homomorphism games between relational structures can be viewed as Kleisli morphisms for a quantum monad on the (classical) category of relational structures and homomorphisms. We show a general connection between these notions and state-independent quantum realizations of strong contextuality in the Abramsky-Brandenburger formulation of contextuality. We use these results to exhibit a wide range of examples of contextuality-powered quantum advantage, and to unify several apparently diverse strands of previous work. △ Less

Submitted 20 May, 2017; originally announced May 2017.

Comments: 20 pages

Journal ref: 42nd International Symposium on Mathematical Foundations of Computer Science (MFCS 2017), Leibniz International Proceedings in Informatics (LIPIcs) 83: 35:1--35:19, 2017

Showing 1–41 of 41 results for author: de Silva, N