-
Legal Document Retrieval using Document Vector Embeddings and Deep Learning
Authors:
Keet Sugathadasa,
Buddhi Ayesha,
Nisansa de Silva,
Amal Shehan Perera,
Vindula Jayawardana,
Dimuthu Lakmal,
Madhavi Perera
Abstract:
Domain specific information retrieval process has been a prominent and ongoing research in the field of natural language processing. Many researchers have incorporated different techniques to overcome the technical and domain specificity and provide a mature model for various domains of interest. The main bottleneck in these studies is the heavy coupling of domain experts, that makes the entire pr…
▽ More
Domain specific information retrieval process has been a prominent and ongoing research in the field of natural language processing. Many researchers have incorporated different techniques to overcome the technical and domain specificity and provide a mature model for various domains of interest. The main bottleneck in these studies is the heavy coupling of domain experts, that makes the entire process to be time consuming and cumbersome. In this study, we have developed three novel models which are compared against a golden standard generated via the on line repositories provided, specifically for the legal domain. The three different models incorporated vector space representations of the legal domain, where document vector generation was done in two different mechanisms and as an ensemble of the above two. This study contains the research being carried out in the process of representing legal case documents into different vector spaces, whilst incorporating semantic word measures and natural language processing techniques. The ensemble model built in this study, shows a significantly higher accuracy level, which indeed proves the need for incorporation of domain specific semantic similarity measures into the information retrieval process. This study also shows, the impact of varying distribution of the word similarity measures, against varying document vector dimensions, which can lead to improvements in the process of legal information retrieval.
△ Less
Submitted 27 May, 2018;
originally announced May 2018.
-
Semi-Supervised Instance Population of an Ontology using Word Vector Embeddings
Authors:
Vindula Jayawardana,
Dimuthu Lakmal,
Nisansa de Silva,
Amal Shehan Perera,
Keet Sugathadasa,
Buddhi Ayesha,
Madhavi Perera
Abstract:
In many modern day systems such as information extraction and knowledge management agents, ontologies play a vital role in maintaining the concept hierarchies of the selected domain. However, ontology population has become a problematic process due to its nature of heavy coupling with manual human intervention. With the use of word embeddings in the field of natural language processing, it became…
▽ More
In many modern day systems such as information extraction and knowledge management agents, ontologies play a vital role in maintaining the concept hierarchies of the selected domain. However, ontology population has become a problematic process due to its nature of heavy coupling with manual human intervention. With the use of word embeddings in the field of natural language processing, it became a popular topic due to its ability to cope up with semantic sensitivity. Hence, in this study, we propose a novel way of semi-supervised ontology population through word embeddings as the basis. We built several models including traditional benchmark models and new types of models which are based on word embeddings. Finally, we ensemble them together to come up with a synergistic model with better accuracy. We demonstrate that our ensemble model can outperform the individual models.
△ Less
Submitted 9 September, 2017;
originally announced September 2017.
-
Deriving a Representative Vector for Ontology Classes with Instance Word Vector Embeddings
Authors:
Vindula Jayawardana,
Dimuthu Lakmal,
Nisansa de Silva,
Amal Shehan Perera,
Keet Sugathadasa,
Buddhi Ayesha
Abstract:
Selecting a representative vector for a set of vectors is a very common requirement in many algorithmic tasks. Traditionally, the mean or median vector is selected. Ontology classes are sets of homogeneous instance objects that can be converted to a vector space by word vector embeddings. This study proposes a methodology to derive a representative vector for ontology classes whose instances were…
▽ More
Selecting a representative vector for a set of vectors is a very common requirement in many algorithmic tasks. Traditionally, the mean or median vector is selected. Ontology classes are sets of homogeneous instance objects that can be converted to a vector space by word vector embeddings. This study proposes a methodology to derive a representative vector for ontology classes whose instances were converted to the vector space. We start by deriving five candidate vectors which are then used to train a machine learning model that would calculate a representative vector for the class. We show that our methodology out-performs the traditional mean and median vector representations.
△ Less
Submitted 7 June, 2017;
originally announced June 2017.
-
Synergistic Union of Word2Vec and Lexicon for Domain Specific Semantic Similarity
Authors:
Keet Sugathadasa,
Buddhi Ayesha,
Nisansa de Silva,
Amal Shehan Perera,
Vindula Jayawardana,
Dimuthu Lakmal,
Madhavi Perera
Abstract:
Semantic similarity measures are an important part in Natural Language Processing tasks. However Semantic similarity measures built for general use do not perform well within specific domains. Therefore in this study we introduce a domain specific semantic similarity measure that was created by the synergistic union of word2vec, a word embedding method that is used for semantic similarity calculat…
▽ More
Semantic similarity measures are an important part in Natural Language Processing tasks. However Semantic similarity measures built for general use do not perform well within specific domains. Therefore in this study we introduce a domain specific semantic similarity measure that was created by the synergistic union of word2vec, a word embedding method that is used for semantic similarity calculation and lexicon based (lexical) semantic similarity methods. We prove that this proposed methodology out performs word embedding methods trained on generic corpus and methods trained on domain specific corpus but do not use lexical semantic similarity methods to augment the results. Further, we prove that text lemmatization can improve the performance of word embedding methods.
△ Less
Submitted 8 June, 2017; v1 submitted 6 June, 2017;
originally announced June 2017.