Search | arXiv e-print repository

On Zero-Shot Counterspeech Generation by LLMs

Authors: Punyajoy Saha, Aalok Agrawal, Abhik Jana, Chris Biemann, Animesh Mukherjee

Abstract: With the emergence of numerous Large Language Models (LLM), the usage of such models in various Natural Language Processing (NLP) applications is increasing extensively. Counterspeech generation is one such key task where efforts are made to develop generative models by fine-tuning LLMs with hatespeech - counterspeech pairs, but none of these attempts explores the intrinsic properties of large lan… ▽ More With the emergence of numerous Large Language Models (LLM), the usage of such models in various Natural Language Processing (NLP) applications is increasing extensively. Counterspeech generation is one such key task where efforts are made to develop generative models by fine-tuning LLMs with hatespeech - counterspeech pairs, but none of these attempts explores the intrinsic properties of large language models in zero-shot settings. In this work, we present a comprehensive analysis of the performances of four LLMs namely GPT-2, DialoGPT, ChatGPT and FlanT5 in zero-shot settings for counterspeech generation, which is the first of its kind. For GPT-2 and DialoGPT, we further investigate the deviation in performance with respect to the sizes (small, medium, large) of the models. On the other hand, we propose three different prompting strategies for generating different types of counterspeech and analyse the impact of such strategies on the performance of the models. Our analysis shows that there is an improvement in generation quality for two datasets (17%), however the toxicity increase (25%) with increase in model size. Considering type of model, GPT-2 and FlanT5 models are significantly better in terms of counterspeech quality but also have high toxicity as compared to DialoGPT. ChatGPT are much better at generating counter speech than other models across all metrics. In terms of prompting, we find that our proposed strategies help in improving counter speech generation across all the models. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: 12 pages, 7 tables, accepted at LREC-COLING 2024

arXiv:2305.00244 [pdf, other]

A Critical Analysis of the Limitation of Deep Learning based 3D Dental Mesh Segmentation Methods in Segmenting Partial Scans

Authors: Ananya Jana, Aniruddha Maiti, Dimitris N. Metaxas

Abstract: Tooth segmentation from intraoral scans is a crucial part of digital dentistry. Many Deep Learning based tooth segmentation algorithms have been developed for this task. In most of the cases, high accuracy has been achieved, although, most of the available tooth segmentation techniques make an implicit restrictive assumption of full jaw model and they report accuracy based on full jaw models. Medi… ▽ More Tooth segmentation from intraoral scans is a crucial part of digital dentistry. Many Deep Learning based tooth segmentation algorithms have been developed for this task. In most of the cases, high accuracy has been achieved, although, most of the available tooth segmentation techniques make an implicit restrictive assumption of full jaw model and they report accuracy based on full jaw models. Medically, however, in certain cases, full jaw tooth scan is not required or may not be available. Given this practical issue, it is important to understand the robustness of currently available widely used Deep Learning based tooth segmentation techniques. For this purpose, we applied available segmentation techniques on partial intraoral scans and we discovered that the available deep Learning techniques under-perform drastically. The analysis and comparison presented in this work would help us in understanding the severity of the problem and allow us to develop robust tooth segmentation technique without strong assumption of full jaw model. △ Less

Submitted 29 April, 2023; originally announced May 2023.

Comments: accepted to IEEE EMBC 2023

arXiv:2302.12039 [pdf, other]

Natural Language Processing in the Legal Domain

Authors: Daniel Martin Katz, Dirk Hartung, Lauritz Gerlach, Abhik Jana, Michael J. Bommarito II

Abstract: In this paper, we summarize the current state of the field of NLP & Law with a specific focus on recent technical and substantive developments. To support our analysis, we construct and analyze a nearly complete corpus of more than six hundred NLP & Law related papers published over the past decade. Our analysis highlights several major trends. Namely, we document an increasing number of papers wr… ▽ More In this paper, we summarize the current state of the field of NLP & Law with a specific focus on recent technical and substantive developments. To support our analysis, we construct and analyze a nearly complete corpus of more than six hundred NLP & Law related papers published over the past decade. Our analysis highlights several major trends. Namely, we document an increasing number of papers written, tasks undertaken, and languages covered over the course of the past decade. We observe an increase in the sophistication of the methods which researchers deployed in this applied context. Slowly but surely, Legal NLP is beginning to match not only the methodological sophistication of general NLP but also the professional standards of data availability and code reproducibility observed within the broader scientific community. We believe all of these trends bode well for the future of the field, but many questions in both the academic and commercial sphere still remain open. △ Less

Submitted 23 February, 2023; originally announced February 2023.

Comments: 13 pages, 7 figures, 2 tables, online source and data

arXiv:2301.10531 [pdf, other]

3D Tooth Mesh Segmentation with Simplified Mesh Cell Representation

Authors: Ananya Jana, Hrebesh Molly Subhash, Dimitris N. Metaxas

Abstract: Manual tooth segmentation of 3D tooth meshes is tedious and there is variations among dentists. %Manual tooth annotation of 3D tooth meshes is a tedious task. Several deep learning based methods have been proposed to perform automatic tooth mesh segmentation. Many of the proposed tooth mesh segmentation algorithms summarize the mesh cell as - the cell center or barycenter, the normal at barycenter… ▽ More Manual tooth segmentation of 3D tooth meshes is tedious and there is variations among dentists. %Manual tooth annotation of 3D tooth meshes is a tedious task. Several deep learning based methods have been proposed to perform automatic tooth mesh segmentation. Many of the proposed tooth mesh segmentation algorithms summarize the mesh cell as - the cell center or barycenter, the normal at barycenter, the cell vertices and the normals at the cell vertices. Summarizing of the mesh cell/triangle in this manner imposes an implicit structural constraint and makes it difficult to work with multiple resolutions which is done in many point cloud based deep learning algorithms. We propose a novel segmentation method which utilizes only the barycenter and the normal at the barycenter information of the mesh cell and yet achieves competitive performance. We are the first to demonstrate that it is possible to relax the implicit structural constraint and yet achieve superior segmentation performance △ Less

Submitted 25 January, 2023; originally announced January 2023.

Comments: accepted at IEEE ISBI 2023 International Symposium on Biomedical Imaging

arXiv:2209.08132 [pdf, other]

Automatic Tooth Segmentation from 3D Dental Model using Deep Learning: A Quantitative Analysis of what can be learnt from a Single 3D Dental Model

Authors: Ananya Jana, Hrebesh Molly Subhash, Dimitris Metaxas

Abstract: 3D tooth segmentation is an important task for digital orthodontics. Several Deep Learning methods have been proposed for automatic tooth segmentation from 3D dental models or intraoral scans. These methods require annotated 3D intraoral scans. Manually annotating 3D intraoral scans is a laborious task. One approach is to devise self-supervision methods to reduce the manual labeling effort. Compar… ▽ More 3D tooth segmentation is an important task for digital orthodontics. Several Deep Learning methods have been proposed for automatic tooth segmentation from 3D dental models or intraoral scans. These methods require annotated 3D intraoral scans. Manually annotating 3D intraoral scans is a laborious task. One approach is to devise self-supervision methods to reduce the manual labeling effort. Compared to other types of point cloud data like scene point cloud or shape point cloud data, 3D tooth point cloud data has a very regular structure and a strong shape prior. We look at how much representative information can be learnt from a single 3D intraoral scan. We evaluate this quantitatively with the help of ten different methods of which six are generic point cloud segmentation methods whereas the other four are tooth segmentation specific methods. Surprisingly, we find that with a single 3D intraoral scan training, the Dice score can be as high as 0.86 whereas the full training set gives Dice score of 0.94. We conclude that the segmentation methods can learn a great deal of information from a single 3D tooth point cloud scan under suitable conditions e.g. data augmentation. We are the first to quantitatively evaluate and demonstrate the representation learning capability of Deep Learning methods from a single 3D intraoral scan. This can enable building self-supervision methods for tooth segmentation under extreme data limitation scenario by leveraging the available data to the fullest possible extent. △ Less

Submitted 16 September, 2022; originally announced September 2022.

Comments: accepted to SIPAIM 2022

arXiv:2110.00976 [pdf, other]

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

Authors: Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Martin Katz, Nikolaos Aletras

Abstract: Laws and their interpretations, legal arguments and agreements\ are typically expressed in writing, leading to the production of vast corpora of legal text. Their analysis, which is at the center of legal practice, becomes increasingly elaborate as these collections grow in size. Natural language understanding (NLU) technologies can be a valuable tool to support legal practitioners in these endeav… ▽ More Laws and their interpretations, legal arguments and agreements\ are typically expressed in writing, leading to the production of vast corpora of legal text. Their analysis, which is at the center of legal practice, becomes increasingly elaborate as these collections grow in size. Natural language understanding (NLU) technologies can be a valuable tool to support legal practitioners in these endeavors. Their usefulness, however, largely depends on whether current state-of-the-art models can generalize across various tasks in the legal domain. To answer this currently open question, we introduce the Legal General Language Understanding Evaluation (LexGLUE) benchmark, a collection of datasets for evaluating model performance across a diverse set of legal NLU tasks in a standardized way. We also provide an evaluation and analysis of several generic and legal-oriented models demonstrating that the latter consistently offer performance improvements across multiple tasks. △ Less

Submitted 8 November, 2022; v1 submitted 3 October, 2021; originally announced October 2021.

Comments: 9 pages, long paper at ACL 2022 proceedings. LexGLUE benchmark is available at: https://huggingface.co/datasets/lex_glue. Code is available at: https://github.com/coastalcph/lex-glue. Update TFIDF-SVM scores in the last version

arXiv:2109.05087 [pdf, other]

Global and Local Interpretation of black-box Machine Learning models to determine prognostic factors from early COVID-19 data

Authors: Ananya Jana, Carlos D. Minacapelli, Vinod Rustgi, Dimitris Metaxas

Abstract: The COVID-19 corona virus has claimed 4.1 million lives, as of July 24, 2021. A variety of machine learning models have been applied to related data to predict important factors such as the severity of the disease, infection rate and discover important prognostic factors. Often the usefulness of the findings from the use of these techniques is reduced due to lack of method interpretability. Some r… ▽ More The COVID-19 corona virus has claimed 4.1 million lives, as of July 24, 2021. A variety of machine learning models have been applied to related data to predict important factors such as the severity of the disease, infection rate and discover important prognostic factors. Often the usefulness of the findings from the use of these techniques is reduced due to lack of method interpretability. Some recent progress made on the interpretability of machine learning models has the potential to unravel more insights while using conventional machine learning models. In this work, we analyze COVID-19 blood work data with some of the popular machine learning models; then we employ state-of-the-art post-hoc local interpretability techniques(e.g.- SHAP, LIME), and global interpretability techniques(e.g. - symbolic metamodeling) to the trained black-box models to draw interpretable conclusions. In the gamut of machine learning algorithms, regressions remain one of the simplest and most explainable models with clear mathematical formulation. We explore one of the most recent techniques called symbolic metamodeling to find the mathematical expression of the machine learning models for COVID-19. We identify Acute Kidney Injury (AKI), initial Albumin level (ALBI), Aspartate aminotransferase (ASTI), Total Bilirubin initial(TBILI) and D-Dimer initial (DIMER) as major prognostic factors of the disease severity. Our contributions are- (i) uncover the underlying mathematical expression for the black-box models on COVID-19 severity prediction task (ii) we are the first to apply symbolic metamodeling to this task, and (iii) discover important features and feature interactions. △ Less

Submitted 10 September, 2021; originally announced September 2021.

Comments: accepted by SIPAIM 2021, code repository: https://github.com/ananyajana/interpretablecovid19

arXiv:2103.03761 [pdf, other]

Liver Fibrosis and NAS scoring from CT images using self-supervised learning and texture encoding

Authors: Ananya Jana, Hui Qu, Carlos D. Minacapelli, Carolyn Catalano, Vinod Rustgi, Dimitris Metaxas

Abstract: Non-alcoholic fatty liver disease (NAFLD) is one of the most common causes of chronic liver diseases (CLD) which can progress to liver cancer. The severity and treatment of NAFLD is determined by NAFLD Activity Scores (NAS)and liver fibrosis stage, which are usually obtained from liver biopsy. However, biopsy is invasive in nature and involves risk of procedural complications. Current methods to p… ▽ More Non-alcoholic fatty liver disease (NAFLD) is one of the most common causes of chronic liver diseases (CLD) which can progress to liver cancer. The severity and treatment of NAFLD is determined by NAFLD Activity Scores (NAS)and liver fibrosis stage, which are usually obtained from liver biopsy. However, biopsy is invasive in nature and involves risk of procedural complications. Current methods to predict the fibrosis and NAS scores from noninvasive CT images rely heavily on either a large annotated dataset or transfer learning using pretrained networks. However, the availability of a large annotated dataset cannot be always ensured andthere can be domain shifts when using transfer learning. In this work, we propose a self-supervised learning method to address both problems. As the NAFLD causes changes in the liver texture, we also propose to use texture encoded inputs to improve the performance of the model. Given a relatively small dataset with 30 patients, we employ a self-supervised network which achieves better performance than a network trained via transfer learning. The code is publicly available at https://github.com/ananyajana/fibrosis_code. △ Less

Submitted 15 March, 2021; v1 submitted 5 March, 2021; originally announced March 2021.

Comments: 5 pages, 2 figures, accepted at ISBI 2021, code at this URL: https://github.com/ananyajana/fibrosis_code

arXiv:2009.10687 [pdf, other]

Deep Learning based NAS Score and Fibrosis Stage Prediction from CT and Pathology Data

Authors: Ananya Jana, Hui Qu, Puru Rattan, Carlos D. Minacapelli, Vinod Rustgi, Dimitris Metaxas

Abstract: Non-Alcoholic Fatty Liver Disease (NAFLD) is becoming increasingly prevalent in the world population. Without diagnosis at the right time, NAFLD can lead to non-alcoholic steatohepatitis (NASH) and subsequent liver damage. The diagnosis and treatment of NAFLD depend on the NAFLD activity score (NAS) and the liver fibrosis stage, which are usually evaluated from liver biopsies by pathologists. In t… ▽ More Non-Alcoholic Fatty Liver Disease (NAFLD) is becoming increasingly prevalent in the world population. Without diagnosis at the right time, NAFLD can lead to non-alcoholic steatohepatitis (NASH) and subsequent liver damage. The diagnosis and treatment of NAFLD depend on the NAFLD activity score (NAS) and the liver fibrosis stage, which are usually evaluated from liver biopsies by pathologists. In this work, we propose a novel method to automatically predict NAS score and fibrosis stage from CT data that is non-invasive and inexpensive to obtain compared with liver biopsy. We also present a method to combine the information from CT and H\&E stained pathology data to improve the performance of NAS score and fibrosis stage prediction, when both types of data are available. This is of great value to assist the pathologists in computer-aided diagnosis process. Experiments on a 30-patient dataset illustrate the effectiveness of our method. △ Less

Submitted 22 September, 2020; originally announced September 2020.

Comments: 6 pages, 3 figures. Accepted in IEEE BIBE 2020

arXiv:2003.08433 [pdf, other]

doi 10.56553/popets-2022-0100

Neural Fuzzy Extractors: A Secure Way to Use Artificial Neural Networks for Biometric User Authentication

Authors: Abhishek Jana, Bipin Paudel, Md Kamruzzaman Sarker, Monireh Ebrahimi, Pascal Hitzler, George T Amariucai

Abstract: Powered by new advances in sensor development and artificial intelligence, the decreasing cost of computation, and the pervasiveness of handheld computation devices, biometric user authentication (and identification) is rapidly becoming ubiquitous. Modern approaches to biometric authentication, based on sophisticated machine learning techniques, cannot avoid storing either trained-classifier detai… ▽ More Powered by new advances in sensor development and artificial intelligence, the decreasing cost of computation, and the pervasiveness of handheld computation devices, biometric user authentication (and identification) is rapidly becoming ubiquitous. Modern approaches to biometric authentication, based on sophisticated machine learning techniques, cannot avoid storing either trained-classifier details or explicit user biometric data, thus exposing users' credentials to falsification. In this paper, we introduce a secure way to handle user-specific information involved with the use of vector-space classifiers or artificial neural networks for biometric authentication. Our proposed architecture, called a Neural Fuzzy Extractor (NFE), allows the coupling of pre-existing classifiers with fuzzy extractors, through a artificial-neural-network-based buffer called an expander, with minimal or no performance degradation. The NFE thus offers all the performance advantages of modern deep-learning-based classifiers, and all the security of standard fuzzy extractors. We demonstrate the NFE retrofit to a classic artificial neural network for a simple scenario of fingerprint-based user authentication. △ Less

Submitted 18 December, 2023; v1 submitted 18 March, 2020; originally announced March 2020.

Comments: 8 pages, 5 figures

Journal ref: Proceedings on Privacy Enhancing Technologies, 2022, volume 4, pages 86-104

arXiv:2002.11506 [pdf, other]

Using Distributional Thesaurus Embedding for Co-hyponymy Detection

Authors: Abhik Jana, Nikhil Reddy Varimalla, Pawan Goyal

Abstract: Discriminating lexical relations among distributionally similar words has always been a challenge for natural language processing (NLP) community. In this paper, we investigate whether the network embedding of distributional thesaurus can be effectively utilized to detect co-hyponymy relations. By extensive experiments over three benchmark datasets, we show that the vector representation obtained… ▽ More Discriminating lexical relations among distributionally similar words has always been a challenge for natural language processing (NLP) community. In this paper, we investigate whether the network embedding of distributional thesaurus can be effectively utilized to detect co-hyponymy relations. By extensive experiments over three benchmark datasets, we show that the vector representation obtained by applying node2vec on distributional thesaurus outperforms the state-of-the-art models for binary classification of co-hyponymy vs. hypernymy, as well as co-hyponymy vs. meronymy, by huge margins. △ Less

Submitted 24 February, 2020; originally announced February 2020.

Comments: Accepted in LREC 2020. arXiv admin note: text overlap with arXiv:1802.04609

arXiv:1909.09774 [pdf]

LULC classification methodology based on simple Convolutional Neural Network to map complex urban forms at finer scale: Evidence from Mumbai

Authors: Deepank Verma, Arnab Jana

Abstract: The satellite imagery classification task is fundamental to spatial knowledge discovery. Several image classification methods are used to create standardized Land use and Land cover (LULC) maps, which facilitate research on spatial and ecological processes and human activities. Local Climate Zones (LCZ) classification maps are an example of standardized maps which have been widely used to demarcat… ▽ More The satellite imagery classification task is fundamental to spatial knowledge discovery. Several image classification methods are used to create standardized Land use and Land cover (LULC) maps, which facilitate research on spatial and ecological processes and human activities. Local Climate Zones (LCZ) classification maps are an example of standardized maps which have been widely used to demarcate the homogeneity in built and natural character in the cities. The LCZ classification scheme is primarily focused on urban climate-related research, in which 17 climate zones are mapped in a city area with the 100-150m spatial resolution. Each zone exhibits physical properties related to urban form and functions essential for thermal behavior studies. Extending this widely adopted approach to create LULC maps at finer resolution using the LCZ mapping scheme would benefit the allied domains of urban planning, transportation, and water resources management. This study proposes a novel solution to generate classification maps with a 10-band Sentinel-2B dataset and Convolutional Neural Networks (CNN) at the 10m spatial resolution. The classification benefits from CNNs property to preserve local structures in the image datasets. The proposed CNN model outperforms traditional machine learning models such as Artificial Neural Network, Random Forests, and Support Vector Machines. The overall accuracy and kappa statistic of the CNN model trained on 14 urban and natural classes are 82 percent and 0.81, respectively. The study also discusses the utility of the model for specialized remote sensing tasks such as change detection, identification of slum settlements, and mapping pervious/impervious layers in urban settlements with higher accuracy. △ Less

Submitted 1 May, 2020; v1 submitted 21 September, 2019; originally announced September 2019.

Comments: 28 pages, 9 figures

arXiv:1909.00160 [pdf, other]

doi 10.18653/v1/D19-1631

Incorporating Domain Knowledge into Medical NLI using Knowledge Graphs

Authors: Soumya Sharma, Bishal Santra, Abhik Jana, T. Y. S. S. Santosh, Niloy Ganguly, Pawan Goyal

Abstract: Recently, biomedical version of embeddings obtained from language models such as BioELMo have shown state-of-the-art results for the textual inference task in the medical domain. In this paper, we explore how to incorporate structured domain knowledge, available in the form of a knowledge graph (UMLS), for the Medical NLI task. Specifically, we experiment with fusing embeddings obtained from knowl… ▽ More Recently, biomedical version of embeddings obtained from language models such as BioELMo have shown state-of-the-art results for the textual inference task in the medical domain. In this paper, we explore how to incorporate structured domain knowledge, available in the form of a knowledge graph (UMLS), for the Medical NLI task. Specifically, we experiment with fusing embeddings obtained from knowledge graph with the state-of-the-art approaches for NLI task (ESIM model). We also experiment with fusing the domain-specific sentiment information for the task. Experiments conducted on MedNLI dataset clearly show that this strategy improves the baseline BioELMo architecture for the Medical NLI task. △ Less

Submitted 31 August, 2019; originally announced September 2019.

Comments: EMNLP 2019 accepted short paper

arXiv:1906.03007 [pdf, ps, other]

On the Compositionality Prediction of Noun Phrases using Poincaré Embeddings

Authors: Abhik Jana, Dmitry Puzyrev, Alexander Panchenko, Pawan Goyal, Chris Biemann, Animesh Mukherjee

Abstract: The compositionality degree of multiword expressions indicates to what extent the meaning of a phrase can be derived from the meaning of its constituents and their grammatical relations. Prediction of (non)-compositionality is a task that has been frequently addressed with distributional semantic models. We introduce a novel technique to blend hierarchical information with distributional informati… ▽ More The compositionality degree of multiword expressions indicates to what extent the meaning of a phrase can be derived from the meaning of its constituents and their grammatical relations. Prediction of (non)-compositionality is a task that has been frequently addressed with distributional semantic models. We introduce a novel technique to blend hierarchical information with distributional information for predicting compositionality. In particular, we use hypernymy information of the multiword and its constituents encoded in the form of the recently introduced Poincaré embeddings in addition to the distributional information to detect compositionality for noun phrases. Using a weighted average of the distributional similarity and a Poincaré similarity function, we obtain consistent and substantial, statistically significant improvement across three gold standard datasets over state-of-the-art models based on distributional information only. Unlike traditional approaches that solely use an unsupervised setting, we have also framed the problem as a supervised task, obtaining comparable improvements. Further, we publicly release our Poincaré embeddings, which are trained on the output of handcrafted lexical-syntactic patterns on a large corpus. △ Less

Submitted 7 June, 2019; originally announced June 2019.

Comments: Accepted in ACL 2019 [Long Paper]

arXiv:1812.05936 [pdf, other]

Detecting Reliable Novel Word Senses: A Network-Centric Approach

Authors: Abhik Jana, Animesh Mukherjee, Pawan Goyal

Abstract: In this era of Big Data, due to expeditious exchange of information on the web, words are being used to denote newer meanings, causing linguistic shift. With the recent availability of large amounts of digitized texts, an automated analysis of the evolution of language has become possible. Our study mainly focuses on improving the detection of new word senses. This paper presents a unique proposal… ▽ More In this era of Big Data, due to expeditious exchange of information on the web, words are being used to denote newer meanings, causing linguistic shift. With the recent availability of large amounts of digitized texts, an automated analysis of the evolution of language has become possible. Our study mainly focuses on improving the detection of new word senses. This paper presents a unique proposal based on network features to improve the precision of new word sense detection. For a candidate word where a new sense (birth) has been detected by comparing the sense clusters induced at two different time points, we further compare the network properties of the subgraphs induced from novel sense cluster across these two time points. Using the mean fractional change in edge density, structural similarity and average path length as features in an SVM classifier, manual evaluation gives precision values of 0.86 and 0.74 for the task of new sense detection, when tested on 2 distinct time-point pairs, in comparison to the precision values in the range of 0.23-0.32, when the proposed scheme is not used. The outlined method can therefore be used as a new post-hoc step to improve the precision of novel word sense detection in a robust and reliable way where the underlying framework uses a graph structure. Another important observation is that even though our proposal is a post-hoc step, it can be used in isolation and that itself results in a very decent performance achieving a precision of 0.54-0.62. Finally, we show that our method is able to detect the well-known historical shifts in 80% cases. △ Less

Submitted 14 December, 2018; originally announced December 2018.

arXiv:1806.04092 [pdf, other]

WikiRef: Wikilinks as a route to recommending appropriate references for scientific Wikipedia pages

Authors: Abhik Jana, Pranjal Kanojiya, Pawan Goyal, Animesh Mukherjee

Abstract: The exponential increase in the usage of Wikipedia as a key source of scientific knowledge among the researchers is making it absolutely necessary to metamorphose this knowledge repository into an integral and self-contained source of information for direct utilization. Unfortunately, the references which support the content of each Wikipedia entity page, are far from complete. Why are the referen… ▽ More The exponential increase in the usage of Wikipedia as a key source of scientific knowledge among the researchers is making it absolutely necessary to metamorphose this knowledge repository into an integral and self-contained source of information for direct utilization. Unfortunately, the references which support the content of each Wikipedia entity page, are far from complete. Why are the reference section ill-formed for most Wikipedia pages? Is this section edited as frequently as the other sections of a page? Can there be appropriate surrogates that can automatically enhance the reference section? In this paper, we propose a novel two step approach -- WikiRef -- that (i) leverages the wikilinks present in a scientific Wikipedia target page and, thereby, (ii) recommends highly relevant references to be included in that target page appropriately and automatically borrowed from the reference section of the wikilinks. In the first step, we build a classifier to ascertain whether a wikilink is a potential source of reference or not. In the following step, we recommend references to the target page from the reference section of the wikilinks that are classified as potential sources of references in the first step. We perform an extensive evaluation of our approach on datasets from two different domains -- Computer Science and Physics. For Computer Science we achieve a notably good performance with a precision@1 of 0.44 for reference recommendation as opposed to 0.38 obtained from the most competitive baseline. For the Physics dataset, we obtain a similar performance boost of 10% with respect to the most competitive baseline. △ Less

Submitted 15 June, 2018; v1 submitted 11 June, 2018; originally announced June 2018.

arXiv:1802.06196 [pdf, other]

Can Network Embedding of Distributional Thesaurus be Combined with Word Vectors for Better Representation?

Authors: Abhik Jana, Pawan Goyal

Abstract: Distributed representations of words learned from text have proved to be successful in various natural language processing tasks in recent times. While some methods represent words as vectors computed from text using predictive model (Word2vec) or dense count based model (GloVe), others attempt to represent these in a distributional thesaurus network structure where the neighborhood of a word is a… ▽ More Distributed representations of words learned from text have proved to be successful in various natural language processing tasks in recent times. While some methods represent words as vectors computed from text using predictive model (Word2vec) or dense count based model (GloVe), others attempt to represent these in a distributional thesaurus network structure where the neighborhood of a word is a set of words having adequate context overlap. Being motivated by recent surge of research in network embedding techniques (DeepWalk, LINE, node2vec etc.), we turn a distributional thesaurus network into dense word vectors and investigate the usefulness of distributional thesaurus embedding in improving overall word representation. This is the first attempt where we show that combining the proposed word representation obtained by distributional thesaurus embedding with the state-of-the-art word representations helps in improving the performance by a significant margin when evaluated against NLP tasks like word similarity and relatedness, synonym detection, analogy detection. Additionally, we show that even without using any handcrafted lexical resources we can come up with representations having comparable performance in the word similarity and relatedness tasks compared to the representations where a lexical resource has been used. △ Less

Submitted 17 February, 2018; originally announced February 2018.

arXiv:1802.04609 [pdf, other]

Network Features Based Co-hyponymy Detection

Authors: Abhik Jana, Pawan Goyal

Abstract: Distinguishing lexical relations has been a long term pursuit in natural language processing (NLP) domain. Recently, in order to detect lexical relations like hypernymy, meronymy, co-hyponymy etc., distributional semantic models are being used extensively in some form or the other. Even though a lot of efforts have been made for detecting hypernymy relation, the problem of co-hyponymy detection ha… ▽ More Distinguishing lexical relations has been a long term pursuit in natural language processing (NLP) domain. Recently, in order to detect lexical relations like hypernymy, meronymy, co-hyponymy etc., distributional semantic models are being used extensively in some form or the other. Even though a lot of efforts have been made for detecting hypernymy relation, the problem of co-hyponymy detection has been rarely investigated. In this paper, we are proposing a novel supervised model where various network measures have been utilized to identify co-hyponymy relation with high accuracy performing better or at par with the state-of-the-art models. △ Less

Submitted 13 February, 2018; originally announced February 2018.

arXiv:1710.05246 [pdf]

Shared High Value Research Resources: The CamCAN Human Lifespan Neuroimaging Dataset Processed on the Open Science Grid

Authors: Don Krieger, Paul Shepard, Ben Zusman, Anirban Jana, David O. Okonkwo

Abstract: The CamCAN Lifespan Neuroimaging Dataset, Cambridge (UK) Centre for Ageing and Neuroscience, was acquired and processed beginning in December, 2016. The referee consensus solver deployed to the Open Science Grid was used for this task. The dataset includes demographic and screening measures, a high-resolution MRI scan of the brain, and whole-head magnetoencephalographic (MEG) recordings during eye… ▽ More The CamCAN Lifespan Neuroimaging Dataset, Cambridge (UK) Centre for Ageing and Neuroscience, was acquired and processed beginning in December, 2016. The referee consensus solver deployed to the Open Science Grid was used for this task. The dataset includes demographic and screening measures, a high-resolution MRI scan of the brain, and whole-head magnetoencephalographic (MEG) recordings during eyes closed rest (560 sec), a simple task (540 sec), and passive listening/viewing (140 sec). The data were collected from 619 neurologically normal individuals, ages 18-87. The processed results from the resting recordings are completed and available online. These constitute 1.7 TBytes of data including the location within the brain (1 mm resolution), time stamp (1 msec resolution), and 80 msec time course for each of 3.7 billion validated neuroelectric events, i.e. mean 6.1 million events for each of the 619 participants. The referee consensus solver provides high yield (mean 11,000 neuroelectric currents/sec; standard deviation (sd): 3500/sec) high confidence (p < 10-12 for each identified current) measures of the neuroelectric currents whose magnetic fields are detected in the MEG recordings. We describe the solver, the implementation of the solver deployed on the Open Science Grid, the workflow management system, the opportunistic use of high performance computing (HPC) resources to add computing capacity to the Open Science Grid reserved for this project, and our initial findings from the recently completed processing of the resting recordings. This required 14 million core hours, i.e. 40 core hours per second of data. △ Less

Submitted 8 December, 2017; v1 submitted 14 October, 2017; originally announced October 2017.

Comments: 8 pages, 7 figures; Proceedings of the 2017 IEEE International Conference on Bioinformatics and Biomedicine; Keynote to The International Workshop on High Throughput Computing in Bioinformatics and Biomedicine using the Open Science Grid

arXiv:1705.03264 [pdf, other]

WikiM: Metapaths based Wikification of Scientific Abstracts

Authors: Abhik Jana, Sruthi Mooriyath, Animesh Mukherjee, Pawan Goyal

Abstract: In order to disseminate the exponential extent of knowledge being produced in the form of scientific publications, it would be best to design mechanisms that connect it with already existing rich repository of concepts -- the Wikipedia. Not only does it make scientific reading simple and easy (by connecting the involved concepts used in the scientific articles to their Wikipedia explanations) but… ▽ More In order to disseminate the exponential extent of knowledge being produced in the form of scientific publications, it would be best to design mechanisms that connect it with already existing rich repository of concepts -- the Wikipedia. Not only does it make scientific reading simple and easy (by connecting the involved concepts used in the scientific articles to their Wikipedia explanations) but also improves the overall quality of the article. In this paper, we present a novel metapath based method, WikiM, to efficiently wikify scientific abstracts -- a topic that has been rarely investigated in the literature. One of the prime motivations for this work comes from the observation that, wikified abstracts of scientific documents help a reader to decide better, in comparison to the plain abstracts, whether (s)he would be interested to read the full article. We perform mention extraction mostly through traditional tf-idf measures coupled with a set of smart filters. The entity linking heavily leverages on the rich citation and author publication networks. Our observation is that various metapaths defined over these networks can significantly enhance the overall performance of the system. For mention extraction and entity linking, we outperform most of the competing state-of-the-art techniques by a large margin arriving at precision values of 72.42% and 73.8% respectively over a dataset from the ACL Anthology Network. In order to establish the robustness of our scheme, we wikify three other datasets and get precision values of 63.41%-94.03% and 67.67%-73.29% respectively for the mention extraction and the entity linking phase. △ Less

Submitted 9 May, 2017; originally announced May 2017.

arXiv:1608.05368 [pdf, ps, other]

Scaling Bounded Model Checking By Transforming Programs With Arrays

Authors: Anushri Jana, Uday P. Khedker, Advaita Datar, R Venkatesh, C Niyas

Abstract: Bounded Model Checking is one the most successful techniques for finding bugs in program. However, for programs with loops iterating over large-sized arrays, bounded model checkers often exceed the limit of resources available to them. We present a transformation that enables bounded model checkers to verify a certain class of array properties. Our technique transforms an array-manipulating progra… ▽ More Bounded Model Checking is one the most successful techniques for finding bugs in program. However, for programs with loops iterating over large-sized arrays, bounded model checkers often exceed the limit of resources available to them. We present a transformation that enables bounded model checkers to verify a certain class of array properties. Our technique transforms an array-manipulating program in ANSI-C to an array-free and loop-free program. The transformed program can efficiently be verified by an off-the-shelf bounded model checker. Though the transformed program is, in general, an abstraction of the original program, we formally characterize the properties for which the transformation is precise. We demonstrate the applicability and usefulness of our technique on both industry code as well as academic benchmarks. △ Less

Submitted 17 August, 2016; originally announced August 2016.

Comments: Pre-proceedings paper presented at the 26th International Symposium on Logic-Based Program Synthesis and Transformation (LOPSTR 2016), Edinburgh, Scotland UK, 6-8 September 2016 (arXiv:1608.02534)

Report number: LOPSTR/2016/23

arXiv:1606.06974 [pdf, ps, other]

Scaling Bounded Model Checking By Transforming Programs With Arrays

Authors: Anushri Jana, Uday P. Khedker, Advaita Datar, R Venkatesh, C Niyas

Abstract: Bounded Model Checking is one the most successful techniques for finding bugs in program. However, model checkers are resource hungry and are often unable to verify programs with loops iterating over large arrays.We present a transformation that enables bounded model checkers to verify a certain class of array properties. Our technique transforms an array-manipulating (ANSI-C) program to an array-… ▽ More Bounded Model Checking is one the most successful techniques for finding bugs in program. However, model checkers are resource hungry and are often unable to verify programs with loops iterating over large arrays.We present a transformation that enables bounded model checkers to verify a certain class of array properties. Our technique transforms an array-manipulating (ANSI-C) program to an array-free and loop-free (ANSI-C) program thereby reducing the resource requirements of a model checker significantly. Model checking of the transformed program using an off-the-shelf bounded model checker simulates the loop iterations efficiently. Thus, our transformed program is a sound abstraction of the original program and is also precise in a large number of cases - we formally characterize the class of programs for which it is guaranteed to be precise. We demonstrate the applicability and usefulness of our technique on both industry code as well as academic benchmarks. △ Less

Submitted 7 March, 2017; v1 submitted 22 June, 2016; originally announced June 2016.

Showing 1–22 of 22 results for author: Jana, A