-
Cross-Data Knowledge Graph Construction for LLM-enabled Educational Question-Answering System: A~Case~Study~at~HCMUT
Authors:
Tuan Bui,
Oanh Tran,
Phuong Nguyen,
Bao Ho,
Long Nguyen,
Thang Bui,
Tho Quan
Abstract:
In today's rapidly evolving landscape of Artificial Intelligence, large language models (LLMs) have emerged as a vibrant research topic. LLMs find applications in various fields and contribute significantly. Despite their powerful language capabilities, similar to pre-trained language models (PLMs), LLMs still face challenges in remembering events, incorporating new information, and addressing dom…
▽ More
In today's rapidly evolving landscape of Artificial Intelligence, large language models (LLMs) have emerged as a vibrant research topic. LLMs find applications in various fields and contribute significantly. Despite their powerful language capabilities, similar to pre-trained language models (PLMs), LLMs still face challenges in remembering events, incorporating new information, and addressing domain-specific issues or hallucinations. To overcome these limitations, researchers have proposed Retrieval-Augmented Generation (RAG) techniques, some others have proposed the integration of LLMs with Knowledge Graphs (KGs) to provide factual context, thereby improving performance and delivering more accurate feedback to user queries.
Education plays a crucial role in human development and progress. With the technology transformation, traditional education is being replaced by digital or blended education. Therefore, educational data in the digital environment is increasing day by day. Data in higher education institutions are diverse, comprising various sources such as unstructured/structured text, relational databases, web/app-based API access, etc. Constructing a Knowledge Graph from these cross-data sources is not a simple task. This article proposes a method for automatically constructing a Knowledge Graph from multiple data sources and discusses some initial applications (experimental trials) of KG in conjunction with LLMs for question-answering tasks.
△ Less
Submitted 14 April, 2024;
originally announced April 2024.
-
Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models
Authors:
Sang T. Truong,
Duc Q. Nguyen,
Toan Nguyen,
Dong D. Le,
Nhi N. Truong,
Tho Quan,
Sanmi Koyejo
Abstract:
Recent advancements in large language models (LLMs) have underscored their importance in the evolution of artificial intelligence. However, despite extensive pretraining on multilingual datasets, available open-sourced LLMs exhibit limited effectiveness in processing Vietnamese. The challenge is exacerbated by the absence of systematic benchmark datasets and metrics tailored for Vietnamese LLM eva…
▽ More
Recent advancements in large language models (LLMs) have underscored their importance in the evolution of artificial intelligence. However, despite extensive pretraining on multilingual datasets, available open-sourced LLMs exhibit limited effectiveness in processing Vietnamese. The challenge is exacerbated by the absence of systematic benchmark datasets and metrics tailored for Vietnamese LLM evaluation. To mitigate these issues, we have finetuned LLMs specifically for Vietnamese and developed a comprehensive evaluation framework encompassing 10 common tasks and 31 metrics. Our evaluation results reveal that the fine-tuned LLMs exhibit enhanced comprehension and generative capabilities in Vietnamese. Moreover, our analysis indicates that models with more parameters can introduce more biases and uncalibrated outputs and the key factor influencing LLM performance is the quality of the training or fine-tuning datasets. These insights underscore the significance of meticulous fine-tuning with high-quality datasets in enhancing LLM performance.
△ Less
Submitted 26 May, 2024; v1 submitted 5 March, 2024;
originally announced March 2024.
-
LAMPAT: Low-Rank Adaption for Multilingual Paraphrasing Using Adversarial Training
Authors:
Khoi M. Le,
Trinh Pham,
Tho Quan,
Anh Tuan Luu
Abstract:
Paraphrases are texts that convey the same meaning while using different words or sentence structures. It can be used as an automatic data augmentation tool for many Natural Language Processing tasks, especially when dealing with low-resource languages, where data shortage is a significant problem. To generate a paraphrase in multilingual settings, previous studies have leveraged the knowledge fro…
▽ More
Paraphrases are texts that convey the same meaning while using different words or sentence structures. It can be used as an automatic data augmentation tool for many Natural Language Processing tasks, especially when dealing with low-resource languages, where data shortage is a significant problem. To generate a paraphrase in multilingual settings, previous studies have leveraged the knowledge from the machine translation field, i.e., forming a paraphrase through zero-shot machine translation in the same language. Despite good performance on human evaluation, those methods still require parallel translation datasets, thus making them inapplicable to languages that do not have parallel corpora. To mitigate that problem, we proposed the first unsupervised multilingual paraphrasing model, LAMPAT ($\textbf{L}$ow-rank $\textbf{A}$daptation for $\textbf{M}$ultilingual $\textbf{P}$araphrasing using $\textbf{A}$dversarial $\textbf{T}$raining), by which monolingual dataset is sufficient enough to generate a human-like and diverse sentence. Throughout the experiments, we found out that our method not only works well for English but can generalize on unseen languages as well. Data and code are available at https://github.com/VinAIResearch/LAMPAT.
△ Less
Submitted 23 June, 2024; v1 submitted 8 January, 2024;
originally announced January 2024.
-
xNeuSM: Explainable Neural Subgraph Matching with Graph Learnable Multi-hop Attention Networks
Authors:
Duc Q. Nguyen,
Thanh Toan Nguyen,
Tho quan
Abstract:
Subgraph matching is a challenging problem with a wide range of applications in database systems, biochemistry, and cognitive science. It involves determining whether a given query graph is present within a larger target graph. Traditional graph-matching algorithms provide precise results but face challenges in large graph instances due to the NP-complete problem, limiting their practical applicab…
▽ More
Subgraph matching is a challenging problem with a wide range of applications in database systems, biochemistry, and cognitive science. It involves determining whether a given query graph is present within a larger target graph. Traditional graph-matching algorithms provide precise results but face challenges in large graph instances due to the NP-complete problem, limiting their practical applicability. In contrast, recent neural network-based approximations offer more scalable solutions, but often lack interpretable node correspondences. To address these limitations, this article presents xNeuSM: Explainable Neural Subgraph Matching which introduces Graph Learnable Multi-hop Attention Networks (GLeMA) that adaptively learns the parameters governing the attention factor decay for each node across hops rather than relying on fixed hyperparameters. We provide a theoretical analysis establishing error bounds for GLeMA's approximation of multi-hop attention as a function of the number of hops. Additionally, we prove that learning distinct attention decay factors for each node leads to a correct approximation of multi-hop attention. Empirical evaluation on real-world datasets shows that xNeuSM achieves substantial improvements in prediction accuracy of up to 34% compared to approximate baselines and, notably, at least a seven-fold faster query time than exact algorithms. The source code of our implementation is available at https://github.com/martinakaduc/xNeuSM.
△ Less
Submitted 3 December, 2023;
originally announced December 2023.
-
Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment
Authors:
Cong-Duy Nguyen,
The-Anh Vu-Le,
Thong Nguyen,
Tho Quan,
Luu Anh Tuan
Abstract:
Language models have been supervised with both language-only objective and visual grounding in existing studies of visual-grounded language learning. However, due to differences in the distribution and scale of visual-grounded datasets and language corpora, the language model tends to mix up the context of the tokens that occurred in the grounded data with those that do not. As a result, during re…
▽ More
Language models have been supervised with both language-only objective and visual grounding in existing studies of visual-grounded language learning. However, due to differences in the distribution and scale of visual-grounded datasets and language corpora, the language model tends to mix up the context of the tokens that occurred in the grounded data with those that do not. As a result, during representation learning, there is a mismatch between the visual information and the contextual meaning of the sentence. To overcome this limitation, we propose GroundedBERT - a grounded language learning method that enhances the BERT representation with visually grounded information. GroundedBERT comprises two components: (i) the original BERT which captures the contextual representation of words learned from the language corpora, and (ii) a visual grounding module which captures visual information learned from visual-grounded datasets. Moreover, we employ Optimal Transport (OT), specifically its partial variant, to solve the fractional alignment problem between the two modalities. Our proposed method significantly outperforms the baseline language models on various language tasks of the GLUE and SQuAD datasets.
△ Less
Submitted 9 January, 2024; v1 submitted 3 December, 2023;
originally announced December 2023.
-
Generative Artificial Intelligence for Software Engineering -- A Research Agenda
Authors:
Anh Nguyen-Duc,
Beatriz Cabrero-Daniel,
Adam Przybylek,
Chetan Arora,
Dron Khanna,
Tomas Herda,
Usman Rafiq,
Jorge Melegati,
Eduardo Guerra,
Kai-Kristian Kemell,
Mika Saari,
Zheying Zhang,
Huy Le,
Tho Quan,
Pekka Abrahamsson
Abstract:
Generative Artificial Intelligence (GenAI) tools have become increasingly prevalent in software development, offering assistance to various managerial and technical project activities. Notable examples of these tools include OpenAIs ChatGPT, GitHub Copilot, and Amazon CodeWhisperer. Although many recent publications have explored and evaluated the application of GenAI, a comprehensive understandin…
▽ More
Generative Artificial Intelligence (GenAI) tools have become increasingly prevalent in software development, offering assistance to various managerial and technical project activities. Notable examples of these tools include OpenAIs ChatGPT, GitHub Copilot, and Amazon CodeWhisperer. Although many recent publications have explored and evaluated the application of GenAI, a comprehensive understanding of the current development, applications, limitations, and open challenges remains unclear to many. Particularly, we do not have an overall picture of the current state of GenAI technology in practical software engineering usage scenarios. We conducted a literature review and focus groups for a duration of five months to develop a research agenda on GenAI for Software Engineering. We identified 78 open Research Questions (RQs) in 11 areas of Software Engineering. Our results show that it is possible to explore the adoption of GenAI in partial automation and support decision-making in all software development activities. While the current literature is skewed toward software implementation, quality assurance and software maintenance, other areas, such as requirements engineering, software design, and software engineering education, would need further research attention. Common considerations when implementing GenAI include industry-level assessment, dependability and accuracy, data accessibility, transparency, and sustainability aspects associated with the technology. GenAI is bringing significant changes to the field of software engineering. Nevertheless, the state of research on the topic still remains immature. We believe that this research agenda holds significance and practical value for informing both researchers and practitioners about current applications and guiding future research.
△ Less
Submitted 28 October, 2023;
originally announced October 2023.
-
ParamNet: A Dynamic Parameter Network for Fast Multi-to-One Stain Normalization
Authors:
Hongtao Kang,
Die Luo,
Li Chen,
Junbo Hu,
Tingwei Quan,
Shaoqun Zeng,
Shenghua Cheng,
Xiuli Liu
Abstract:
In practice, digital pathology images are often affected by various factors, resulting in very large differences in color and brightness. Stain normalization can effectively reduce the differences in color and brightness of digital pathology images, thus improving the performance of computer-aided diagnostic systems. Conventional stain normalization methods rely on one or several reference images,…
▽ More
In practice, digital pathology images are often affected by various factors, resulting in very large differences in color and brightness. Stain normalization can effectively reduce the differences in color and brightness of digital pathology images, thus improving the performance of computer-aided diagnostic systems. Conventional stain normalization methods rely on one or several reference images, but one or several images may not adequately represent the entire dataset. Although learning-based stain normalization methods are a general approach, they use complex deep networks, which not only greatly reduce computational efficiency, but also risk introducing artifacts. Some studies use specialized network structures to enhance computational efficiency and reliability, but these methods are difficult to apply to multi-to-one stain normalization due to insufficient network capacity. In this study, we introduced dynamic-parameter network and proposed a novel method for stain normalization, called ParamNet. ParamNet addresses the challenges of limited network capacity and computational efficiency by introducing dynamic parameters (weights and biases of convolutional layers) into the network design. By effectively leveraging these parameters, ParamNet achieves superior performance in stain normalization while maintaining computational efficiency. Results show ParamNet can normalize one whole slide image (WSI) of 100,000x100,000 within 25s. The code is available at: https://github.com/khtao/ParamNet.
△ Less
Submitted 16 July, 2024; v1 submitted 10 May, 2023;
originally announced May 2023.
-
Denoising Diffusion Medical Models
Authors:
Pham Ngoc Huy,
Tran Minh Quan
Abstract:
In this study, we introduce a generative model that can synthesize a large number of radiographical image/label pairs, and thus is asymptotically favorable to downstream activities such as segmentation in bio-medical image analysis. Denoising Diffusion Medical Model (DDMM), the proposed technique, can create realistic X-ray images and associated segmentations on a small number of annotated dataset…
▽ More
In this study, we introduce a generative model that can synthesize a large number of radiographical image/label pairs, and thus is asymptotically favorable to downstream activities such as segmentation in bio-medical image analysis. Denoising Diffusion Medical Model (DDMM), the proposed technique, can create realistic X-ray images and associated segmentations on a small number of annotated datasets as well as other massive unlabeled datasets with no supervision. Radiograph/segmentation pairs are generated jointly by the DDMM sampling process in probabilistic mode. As a result, a vanilla UNet that uses this data augmentation for segmentation task outperforms other similarly data-centric approaches.
△ Less
Submitted 18 April, 2023;
originally announced April 2023.
-
Neural Radiance Projection
Authors:
Pham Ngoc Huy,
Tran Minh Quan
Abstract:
The proposed method, Neural Radiance Projection (NeRP), addresses the three most fundamental shortages of training such a convolutional neural network on X-ray image segmentation: dealing with missing/limited human-annotated datasets; ambiguity on the per-pixel label; and the imbalance across positive- and negative- classes distribution. By harnessing a generative adversarial network, we can synth…
▽ More
The proposed method, Neural Radiance Projection (NeRP), addresses the three most fundamental shortages of training such a convolutional neural network on X-ray image segmentation: dealing with missing/limited human-annotated datasets; ambiguity on the per-pixel label; and the imbalance across positive- and negative- classes distribution. By harnessing a generative adversarial network, we can synthesize a massive amount of physics-based X-ray images, so-called Variationally Reconstructed Radiographs (VRRs), alongside their segmentation from more accurate labeled 3D Computed Tomography data. As a result, VRRs present more faithfully than other projection methods in terms of photo-realistic metrics. Adding outputs from NeRP also surpasses the vanilla UNet models trained on the same pairs of X-ray images.
△ Less
Submitted 15 March, 2022;
originally announced March 2022.
-
The Machine Learning for Combinatorial Optimization Competition (ML4CO): Results and Insights
Authors:
Maxime Gasse,
Quentin Cappart,
Jonas Charfreitag,
Laurent Charlin,
Didier Chételat,
Antonia Chmiela,
Justin Dumouchelle,
Ambros Gleixner,
Aleksandr M. Kazachkov,
Elias Khalil,
Pawel Lichocki,
Andrea Lodi,
Miles Lubin,
Chris J. Maddison,
Christopher Morris,
Dimitri J. Papageorgiou,
Augustin Parjadis,
Sebastian Pokutta,
Antoine Prouvost,
Lara Scavuzzo,
Giulia Zarpellon,
Linxin Yang,
Sha Lai,
Akang Wang,
Xiaodong Luo
, et al. (16 additional authors not shown)
Abstract:
Combinatorial optimization is a well-established area in operations research and computer science. Until recently, its methods have focused on solving problem instances in isolation, ignoring that they often stem from related data distributions in practice. However, recent years have seen a surge of interest in using machine learning as a new approach for solving combinatorial problems, either dir…
▽ More
Combinatorial optimization is a well-established area in operations research and computer science. Until recently, its methods have focused on solving problem instances in isolation, ignoring that they often stem from related data distributions in practice. However, recent years have seen a surge of interest in using machine learning as a new approach for solving combinatorial problems, either directly as solvers or by enhancing exact solvers. Based on this context, the ML4CO aims at improving state-of-the-art combinatorial optimization solvers by replacing key heuristic components. The competition featured three challenging tasks: finding the best feasible solution, producing the tightest optimality certificate, and giving an appropriate solver configuration. Three realistic datasets were considered: balanced item placement, workload apportionment, and maritime inventory routing. This last dataset was kept anonymous for the contestants.
△ Less
Submitted 17 March, 2022; v1 submitted 4 March, 2022;
originally announced March 2022.
-
00
Authors:
Nguyen Thoi Minh Quan
Abstract:
What is the funniest number in cryptography (Episode 2)? 0 [1]. The reason is that $\forall x, x \cdot 0 = 0$, i.e., the equation is satisfied no matter what $x$ is. We'll use zero to attack zero-knowledge proof (ZKP). In particular, we'll discuss a critical issue in a cutting-edge ZKP PLONK [2] C++ implementation which allows an attacker to create a forged proof that all verifiers will accept. We…
▽ More
What is the funniest number in cryptography (Episode 2)? 0 [1]. The reason is that $\forall x, x \cdot 0 = 0$, i.e., the equation is satisfied no matter what $x$ is. We'll use zero to attack zero-knowledge proof (ZKP). In particular, we'll discuss a critical issue in a cutting-edge ZKP PLONK [2] C++ implementation which allows an attacker to create a forged proof that all verifiers will accept. We'll show how theory guides the attack's direction. In practice, the attack works like a charm and we'll show how the attack falls through a chain of perfectly aligned software cracks. In the same codebase, there is an independent critical ECDSA bug where (r, s) = (0, 0) is a valid signature for arbitrary keys and messages, but we won't discuss it further because it's a known ECDSA attack vector in the Google Wycheproof cryptanalysis project [3] that I worked on a few years ago.
All bugs have been responsibly disclosed through the vendor's bug bounty program with total reward $\sim \$15,000$ (thank you).
△ Less
Submitted 14 December, 2021;
originally announced January 2022.
-
Enriching and Controlling Global Semantics for Text Summarization
Authors:
Thong Nguyen,
Anh Tuan Luu,
Truc Lu,
Tho Quan
Abstract:
Recently, Transformer-based models have been proven effective in the abstractive summarization task by creating fluent and informative summaries. Nevertheless, these models still suffer from the short-range dependency problem, causing them to produce summaries that miss the key points of document. In this paper, we attempt to address this issue by introducing a neural topic model empowered with no…
▽ More
Recently, Transformer-based models have been proven effective in the abstractive summarization task by creating fluent and informative summaries. Nevertheless, these models still suffer from the short-range dependency problem, causing them to produce summaries that miss the key points of document. In this paper, we attempt to address this issue by introducing a neural topic model empowered with normalizing flow to capture the global semantics of the document, which are then integrated into the summarization model. In addition, to avoid the overwhelming effect of global semantics on contextualized representation, we introduce a mechanism to control the amount of global semantics supplied to the text generation module. Our method outperforms state-of-the-art summarization models on five common text summarization datasets, namely CNN/DailyMail, XSum, Reddit TIFU, arXiv, and PubMed.
△ Less
Submitted 22 September, 2021;
originally announced September 2021.
-
StainNet: a fast and robust stain normalization network
Authors:
Hongtao Kang,
Die Luo,
Weihua Feng,
Junbo Hu,
Shaoqun Zeng,
Tingwei Quan,
Xiuli Liu
Abstract:
Stain normalization often refers to transferring the color distribution of the source image to that of the target image and has been widely used in biomedical image analysis. The conventional stain normalization is regarded as constructing a pixel-by-pixel color mapping model, which only depends on one reference image, and can not accurately achieve the style transformation between image datasets.…
▽ More
Stain normalization often refers to transferring the color distribution of the source image to that of the target image and has been widely used in biomedical image analysis. The conventional stain normalization is regarded as constructing a pixel-by-pixel color mapping model, which only depends on one reference image, and can not accurately achieve the style transformation between image datasets. In principle, this style transformation can be well solved by the deep learning-based methods due to its complicated network structure, whereas, its complicated structure results in the low computational efficiency and artifacts in the style transformation, which has restricted the practical application. Here, we use distillation learning to reduce the complexity of deep learning methods and a fast and robust network called StainNet to learn the color mapping between the source image and target image. StainNet can learn the color mapping relationship from a whole dataset and adjust the color value in a pixel-to-pixel manner. The pixel-to-pixel manner restricts the network size and avoids artifacts in the style transformation. The results on the cytopathology and histopathology datasets show that StainNet can achieve comparable performance to the deep learning-based methods. Computation results demonstrate StainNet is more than 40 times faster than StainGAN and can normalize a 100,000x100,000 whole slide image in 40 seconds.
△ Less
Submitted 23 July, 2021; v1 submitted 23 December, 2020;
originally announced December 2020.
-
Reinforced Coloring for End-to-End Instance Segmentation
Authors:
Tuan Tran Anh,
Khoa Nguyen-Tuan,
Tran Minh Quan,
Won-Ki Jeong
Abstract:
Instance segmentation is one of the actively studied research topics in computer vision in which many objects of interest should be separated individually. While many feed-forward networks produce high-quality segmentation on different types of images, their results often suffer from topological errors (merging or splitting) for segmentation of many objects, requiring post-processing. Existing ite…
▽ More
Instance segmentation is one of the actively studied research topics in computer vision in which many objects of interest should be separated individually. While many feed-forward networks produce high-quality segmentation on different types of images, their results often suffer from topological errors (merging or splitting) for segmentation of many objects, requiring post-processing. Existing iterative methods, on the other hand, extract a single object at a time using discriminative knowledge-based properties (shapes, boundaries, etc.) without relying on post-processing, but they do not scale well. To exploit the advantages of conventional single-object-per-step segmentation methods without impairing the scalability, we propose a novel iterative deep reinforcement learning agent that learns how to differentiate multiple objects in parallel. Our reward function for the trainable agent is designed to favor grouping pixels belonging to the same object using a graph coloring algorithm. We demonstrate that the proposed method can efficiently perform instance segmentation of many objects without heavy post-processing.
△ Less
Submitted 18 May, 2020; v1 submitted 14 May, 2020;
originally announced May 2020.
-
PageRank algorithm for Directed Hypergraph
Authors:
Loc Tran,
Tho Quan,
An Mai
Abstract:
During the last two decades, we easilly see that the World Wide Web's link structure is modeled as the directed graph. In this paper, we will model the World Wide Web's link structure as the directed hypergraph. Moreover, we will develop the PageRank algorithm for this directed hypergraph. Due to the lack of the World Wide Web directed hypergraph datasets, we will apply the PageRank algorithm to t…
▽ More
During the last two decades, we easilly see that the World Wide Web's link structure is modeled as the directed graph. In this paper, we will model the World Wide Web's link structure as the directed hypergraph. Moreover, we will develop the PageRank algorithm for this directed hypergraph. Due to the lack of the World Wide Web directed hypergraph datasets, we will apply the PageRank algorithm to the metabolic network which is the directed hypergraph itself. The experiments show that our novel PageRank algorithm is successfully applied to this metabolic network.
△ Less
Submitted 6 September, 2022; v1 submitted 29 August, 2019;
originally announced September 2019.
-
Nested Variational Autoencoder for Topic Modeling on Microtexts with Word Vectors
Authors:
Trung Trinh,
Tho Quan,
Trung Mai
Abstract:
Most of the information on the Internet is represented in the form of microtexts, which are short text snippets such as news headlines or tweets. These sources of information are abundant, and mining these data could uncover meaningful insights. Topic modeling is one of the popular methods to extract knowledge from a collection of documents; however, conventional topic models such as latent Dirich…
▽ More
Most of the information on the Internet is represented in the form of microtexts, which are short text snippets such as news headlines or tweets. These sources of information are abundant, and mining these data could uncover meaningful insights. Topic modeling is one of the popular methods to extract knowledge from a collection of documents; however, conventional topic models such as latent Dirichlet allocation (LDA) are unable to perform well on short documents, mostly due to the scarcity of word co-occurrence statistics embedded in the data. The objective of our research is to create a topic model that can achieve great performances on microtexts while requiring a small runtime for scalability to large datasets. To solve the lack of information of microtexts, we allow our method to take advantage of word embeddings for additional knowledge of relationships between words. For speed and scalability, we apply autoencoding variational Bayes, an algorithm that can perform efficient black-box inference in probabilistic models. The result of our work is a novel topic model called the nested variational autoencoder, which is a distribution that takes into account word vectors and is parameterized by a neural network architecture. For optimization, the model is trained to approximate the posterior distribution of the original LDA model. Experiments show the improvements of our model on microtexts as well as its runtime advantage.
△ Less
Submitted 15 September, 2019; v1 submitted 1 May, 2019;
originally announced May 2019.
-
Combination of Domain Knowledge and Deep Learning for Sentiment Analysis of Short and Informal Messages on Social Media
Authors:
Khuong Vo,
Tri Nguyen,
Dang Pham,
Mao Nguyen,
Minh Truong,
Trung Mai,
Tho Quan
Abstract:
Sentiment analysis has been emerging recently as one of the major natural language processing (NLP) tasks in many applications. Especially, as social media channels (e.g. social networks or forums) have become significant sources for brands to observe user opinions about their products, this task is thus increasingly crucial. However, when applied with real data obtained from social media, we noti…
▽ More
Sentiment analysis has been emerging recently as one of the major natural language processing (NLP) tasks in many applications. Especially, as social media channels (e.g. social networks or forums) have become significant sources for brands to observe user opinions about their products, this task is thus increasingly crucial. However, when applied with real data obtained from social media, we notice that there is a high volume of short and informal messages posted by users on those channels. This kind of data makes the existing works suffer from many difficulties to handle, especially ones using deep learning approaches. In this paper, we propose an approach to handle this problem. This work is extended from our previous work, in which we proposed to combine the typical deep learning technique of Convolutional Neural Networks with domain knowledge. The combination is used for acquiring additional training data augmentation and a more reasonable loss function. In this work, we further improve our architecture by various substantial enhancements, including negation-based data augmentation, transfer learning for word embeddings, the combination of word-level embeddings and character-level embeddings, and using multitask learning technique for attaching domain knowledge rules in the learning process. Those enhancements, specifically aiming to handle short and informal messages, help us to enjoy significant improvement in performance once experimenting on real datasets.
△ Less
Submitted 20 December, 2019; v1 submitted 16 February, 2019;
originally announced February 2019.
-
Towards Autoencoding Variational Inference for Aspect-based Opinion Summary
Authors:
Tai Hoang,
Huy Le,
Tho Quan
Abstract:
Aspect-based Opinion Summary (AOS), consisting of aspect discovery and sentiment classification steps, has recently been emerging as one of the most crucial data mining tasks in e-commerce systems. Along this direction, the LDA-based model is considered as a notably suitable approach, since this model offers both topic modeling and sentiment classification. However, unlike traditional topic modeli…
▽ More
Aspect-based Opinion Summary (AOS), consisting of aspect discovery and sentiment classification steps, has recently been emerging as one of the most crucial data mining tasks in e-commerce systems. Along this direction, the LDA-based model is considered as a notably suitable approach, since this model offers both topic modeling and sentiment classification. However, unlike traditional topic modeling, in the context of aspect discovery it is often required some initial seed words, whose prior knowledge is not easy to be incorporated into LDA models. Moreover, LDA approaches rely on sampling methods, which need to load the whole corpus into memory, making them hardly scalable. In this research, we study an alternative approach for AOS problem, based on Autoencoding Variational Inference (AVI). Firstly, we introduce the Autoencoding Variational Inference for Aspect Discovery (AVIAD) model, which extends the previous work of Autoencoding Variational Inference for Topic Models (AVITM) to embed prior knowledge of seed words. This work includes enhancement of the previous AVI architecture and also modification of the loss function. Ultimately, we present the Autoencoding Variational Inference for Joint Sentiment/Topic (AVIJST) model. In this model, we substantially extend the AVI model to support the JST model, which performs topic modeling for corresponding sentiment. The experimental results show that our proposed models enjoy higher topic coherent, faster convergence time and better accuracy on sentiment classification, as compared to their LDA-based counterparts.
△ Less
Submitted 6 June, 2019; v1 submitted 7 February, 2019;
originally announced February 2019.
-
Semantic Document Clustering on Named Entity Features
Authors:
Tru H. Cao,
Vuong M. Ngo,
Dung T. Hong,
Tho T. Quan
Abstract:
Keyword-based information processing has limitations due to simple treatment of words. In this paper, we introduce named entities as objectives into document clustering, which are the key elements defining document semantics and in many cases are of user concerns. First, the traditional keyword-based vector space model is adapted with vectors defined over spaces of entity names, types, name-type p…
▽ More
Keyword-based information processing has limitations due to simple treatment of words. In this paper, we introduce named entities as objectives into document clustering, which are the key elements defining document semantics and in many cases are of user concerns. First, the traditional keyword-based vector space model is adapted with vectors defined over spaces of entity names, types, name-type pairs, and identifiers, instead of keywords. Then, hierarchical document clustering can be performed using the similarity measure defined as the cosines of the vectors representing documents. Experimental results are presented and discussed. Clustering documents by information of named entities could be useful for managing web-based learning materials with respect to related objects.
△ Less
Submitted 20 July, 2018;
originally announced July 2018.
-
Combination of Domain Knowledge and Deep Learning for Sentiment Analysis
Authors:
Khuong Vo,
Dang Pham,
Mao Nguyen,
Trung Mai,
Tho Quan
Abstract:
The emerging technique of deep learning has been widely applied in many different areas. However, when adopted in a certain specific domain, this technique should be combined with domain knowledge to improve efficiency and accuracy. In particular, when analyzing the applications of deep learning in sentiment analysis, we found that the current approaches are suffering from the following drawbacks:…
▽ More
The emerging technique of deep learning has been widely applied in many different areas. However, when adopted in a certain specific domain, this technique should be combined with domain knowledge to improve efficiency and accuracy. In particular, when analyzing the applications of deep learning in sentiment analysis, we found that the current approaches are suffering from the following drawbacks: (i) the existing works have not paid much attention to the importance of different types of sentiment terms, which is an important concept in this area; and (ii) the loss function currently employed does not well reflect the degree of error of sentiment misclassification. To overcome such problem, we propose to combine domain knowledge with deep learning. Our proposal includes using sentiment scores, learnt by quadratic programming, to augment training data; and introducing the penalty matrix for enhancing the loss function of cross entropy. When experimented, we achieved a significant improvement in classification results.
△ Less
Submitted 15 February, 2019; v1 submitted 22 June, 2018;
originally announced June 2018.
-
Compressed Sensing MRI Reconstruction using a Generative Adversarial Network with a Cyclic Loss
Authors:
Tran Minh Quan,
Thanh Nguyen-Duc,
Won-Ki Jeong
Abstract:
Compressed Sensing MRI (CS-MRI) has provided theoretical foundations upon which the time-consuming MRI acquisition process can be accelerated. However, it primarily relies on iterative numerical solvers which still hinders their adaptation in time-critical applications. In addition, recent advances in deep neural networks have shown their potential in computer vision and image processing, but thei…
▽ More
Compressed Sensing MRI (CS-MRI) has provided theoretical foundations upon which the time-consuming MRI acquisition process can be accelerated. However, it primarily relies on iterative numerical solvers which still hinders their adaptation in time-critical applications. In addition, recent advances in deep neural networks have shown their potential in computer vision and image processing, but their adaptation to MRI reconstruction is still in an early stage. In this paper, we propose a novel deep learning-based generative adversarial model, RefineGAN, for fast and accurate CS-MRI reconstruction. The proposed model is a variant of fully-residual convolutional autoencoder and generative adversarial networks (GANs), specifically designed for CS-MRI formulation; it employs deeper generator and discriminator networks with cyclic data consistency loss for faithful interpolation in the given under-sampled k-space data. In addition, our solution leverages a chained network to further enhance the reconstruction quality. RefineGAN is fast and accurate -- the reconstruction process is extremely rapid, as low as tens of milliseconds for reconstruction of a 256x256 image, because it is one-way deployment on a feed-forward network, and the image quality is superior even for extremely low sampling rate (as low as 10%) due to the data-driven nature of the method. We demonstrate that RefineGAN outperforms the state-of-the-art CS-MRI methods by a large margin in terms of both running time and image quality via evaluation using several open-source MRI databases.
△ Less
Submitted 15 March, 2018; v1 submitted 3 September, 2017;
originally announced September 2017.
-
FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics
Authors:
Tran Minh Quan,
David G. C. Hildebrand,
Won-Ki Jeong
Abstract:
Electron microscopic connectomics is an ambitious research direction with the goal of studying comprehensive brain connectivity maps by using high-throughput, nano-scale microscopy. One of the main challenges in connectomics research is developing scalable image analysis algorithms that require minimal user intervention. Recently, deep learning has drawn much attention in computer vision because o…
▽ More
Electron microscopic connectomics is an ambitious research direction with the goal of studying comprehensive brain connectivity maps by using high-throughput, nano-scale microscopy. One of the main challenges in connectomics research is developing scalable image analysis algorithms that require minimal user intervention. Recently, deep learning has drawn much attention in computer vision because of its exceptional performance in image classification tasks. For this reason, its application to connectomic analyses holds great promise, as well. In this paper, we introduce a novel deep neural network architecture, FusionNet, for the automatic segmentation of neuronal structures in connectomics data. FusionNet leverages the latest advances in machine learning, such as semantic segmentation and residual neural networks, with the novel introduction of summation-based skip connections to allow a much deeper network architecture for a more accurate segmentation. We demonstrate the performance of the proposed method by comparing it with state-of-the-art electron microscopy (EM) segmentation methods from the ISBI EM segmentation challenge. We also show the segmentation results on two different tasks including cell membrane and cell body segmentation and a statistical analysis of cell morphology.
△ Less
Submitted 26 December, 2016; v1 submitted 15 December, 2016;
originally announced December 2016.