-
Infusing clinical knowledge into tokenisers for language models
Authors:
Abul Hasan,
Jinge Wu,
Quang Ngoc Nguyen,
Salomé Andres,
Imane Guellil,
Huayu Zhang,
Arlene Casey,
Beatrice Alex,
Bruce Guthrie,
Honghan Wu
Abstract:
This study introduces a novel knowledge enhanced tokenisation mechanism, K-Tokeniser, for clinical text processing. Technically, at initialisation stage, K-Tokeniser populates global representations of tokens based on semantic types of domain concepts (such as drugs or diseases) from either a domain ontology like Unified Medical Language System or the training data of the task related corpus. At t…
▽ More
This study introduces a novel knowledge enhanced tokenisation mechanism, K-Tokeniser, for clinical text processing. Technically, at initialisation stage, K-Tokeniser populates global representations of tokens based on semantic types of domain concepts (such as drugs or diseases) from either a domain ontology like Unified Medical Language System or the training data of the task related corpus. At training or inference stage, sentence level localised context will be utilised for choosing the optimal global token representation to realise the semantic-based tokenisation. To avoid pretraining using the new tokeniser, an embedding initialisation approach is proposed to generate representations for new tokens. Using three transformer-based language models, a comprehensive set of experiments are conducted on four real-world datasets for evaluating K-Tokeniser in a wide range of clinical text analytics tasks including clinical concept and relation extraction, automated clinical coding, clinical phenotype identification, and clinical research article classification. Overall, our models demonstrate consistent improvements over their counterparts in all tasks. In particular, substantial improvements are observed in the automated clinical coding task with 13\% increase on Micro $F_1$ score. Furthermore, K-Tokeniser also shows significant capacities in facilitating quicker converge of language models. Specifically, using K-Tokeniser, the language models would only require 50\% of the training data to achieve the best performance of the baseline tokeniser using all training data in the concept extraction task and less than 20\% of the data for the automated coding task. It is worth mentioning that all these improvements require no pre-training process, making the approach generalisable.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Consensus seeking in diffusive multidimensional networks with a repeated interaction pattern and time-delays
Authors:
Hoang Huy Vu,
Quyen Ngoc Nguyen,
Chuong Van Nguyen,
Tuynh Van Pham,
Minh Hoang Trinh
Abstract:
This paper studies a consensus problem in multidimensional networks having the same agent-to-agent interaction pattern under both intra- and cross-layer time delays. Several conditions for the agents to globally asymptotically achieve a consensus are derived, which involve the overall network's structure, the local interacting pattern, and the values of the time delays. The validity of these condi…
▽ More
This paper studies a consensus problem in multidimensional networks having the same agent-to-agent interaction pattern under both intra- and cross-layer time delays. Several conditions for the agents to globally asymptotically achieve a consensus are derived, which involve the overall network's structure, the local interacting pattern, and the values of the time delays. The validity of these conditions is proved by direct eigenvalue evaluation and supported by numerical simulations.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
Statistical Modeling of Data Breach Risks: Time to Identification and Notification
Authors:
Maochao Xu,
Quynh Nhu Nguyen
Abstract:
It is very challenging to predict the cost of a cyber incident owing to the complex nature of cyber risk. However, it is inevitable for insurance companies who offer cyber insurance policies. The time to identifying an incident and the time to noticing the affected individuals are two important components in determining the cost of a cyber incident. In this work, we initialize the study on those t…
▽ More
It is very challenging to predict the cost of a cyber incident owing to the complex nature of cyber risk. However, it is inevitable for insurance companies who offer cyber insurance policies. The time to identifying an incident and the time to noticing the affected individuals are two important components in determining the cost of a cyber incident. In this work, we initialize the study on those two metrics via statistical modeling approaches. Particularly, we propose a novel approach to imputing the missing data, and further develop a dependence model to capture the complex pattern exhibited by those two metrics. The empirical study shows that the proposed approach has a satisfactory predictive performance and is superior to other commonly used models.
△ Less
Submitted 24 September, 2022; v1 submitted 15 September, 2022;
originally announced September 2022.
-
Integrated ICN and CDN Slice as a Service
Authors:
Ilias Benkacem,
M. Bagaa,
T. Taleb,
Q. N. Nguyen,
T. Tsuda,
T. Sato
Abstract:
In this article, we leverage Network Function Virtualization (NFV) and Multi-Access Edge Computing (MEC) technologies, proposing a system which integrates ICN (Information-Centric Network) with CDN (Content Delivery Network) to provide an efficient content delivery service. The proposed system combines the dynamic CDN slicing concept with the NDN (Named Data Network) based ICN slicing concept to a…
▽ More
In this article, we leverage Network Function Virtualization (NFV) and Multi-Access Edge Computing (MEC) technologies, proposing a system which integrates ICN (Information-Centric Network) with CDN (Content Delivery Network) to provide an efficient content delivery service. The proposed system combines the dynamic CDN slicing concept with the NDN (Named Data Network) based ICN slicing concept to avoid core network congestion. A dynamic CDN slice is deployed to cache content at optimal locations depending on the nature of the content and the geographical distributions of potential viewers. Virtual cache servers, along with supporting virtual transcoders, are placed across a cloud belonging to multiple-administrative domains, forming a CDN slice. The ICN slice is, in turn, used for the regional distribution of content, leveraging the name-based access and the autonomic in-network content caching. This enables the delivery of content from nearby network nodes, avoiding the duplicate transfer of content and also ensuring shorter response times. Our experiments demonstrate that integrated ICN/CDN is better than traditional CDN in almost all aspects, including service scalability, reliability, and quality of service.
△ Less
Submitted 3 January, 2022;
originally announced January 2022.
-
A local geometry of hyperedges in hypergraphs, and its applications to social networks
Authors:
Dong Quan Ngoc Nguyen,
Lin Xing
Abstract:
In many real world datasets arising from social networks, there are hidden higher order relations among data points which cannot be captured using graph modeling. It is natural to use a more general notion of hypergraphs to model such social networks. In this paper, we introduce a new local geometry of hyperdges in hypergraphs which allows to capture higher order relations among data points. Furth…
▽ More
In many real world datasets arising from social networks, there are hidden higher order relations among data points which cannot be captured using graph modeling. It is natural to use a more general notion of hypergraphs to model such social networks. In this paper, we introduce a new local geometry of hyperdges in hypergraphs which allows to capture higher order relations among data points. Furthermore based on this new geometry, we also introduce new methodology--the nearest neighbors method in hypergraphs--for analyzing datasets arising from sociology.
△ Less
Submitted 29 September, 2020;
originally announced October 2020.
-
Community detection, pattern recognition, and hypergraph-based learning: approaches using metric geometry and persistent homology
Authors:
Dong Quan Ngoc Nguyen,
Lin Xing,
Lizhen Lin
Abstract:
Hypergraph data appear and are hidden in many places in the modern age. They are data structure that can be used to model many real data examples since their structures contain information about higher order relations among data points. One of the main contributions of our paper is to introduce a new topological structure to hypergraph data which bears a resemblance to a usual metric space structu…
▽ More
Hypergraph data appear and are hidden in many places in the modern age. They are data structure that can be used to model many real data examples since their structures contain information about higher order relations among data points. One of the main contributions of our paper is to introduce a new topological structure to hypergraph data which bears a resemblance to a usual metric space structure. Using this new topological space structure of hypergraph data, we propose several approaches to study community detection problem, detecting persistent features arising from homological structure of hypergraph data. Also based on the topological space structure of hypergraph data introduced in our paper, we introduce a modified nearest neighbors methods which is a generalization of the classical nearest neighbors methods from machine learning. Our modified nearest neighbors methods have an advantage of being very flexible and applicable even for discrete structures as in hypergraphs. We then apply our modified nearest neighbors methods to study sign prediction problem in hypegraph data constructed using our method.
△ Less
Submitted 29 September, 2020;
originally announced October 2020.
-
Weight Prediction for Variants of Weighted Directed Networks
Authors:
Dong Quan Ngoc Nguyen,
Lin Xing,
Lizhen Lin
Abstract:
A weighted directed network (WDN) is a directed graph in which each edge is associated to a unique value called weight. These networks are very suitable for modeling real-world social networks in which there is an assessment of one vertex toward other vertices. One of the main problems studied in this paper is prediction of edge weights in such networks. We introduce, for the first time, a metric…
▽ More
A weighted directed network (WDN) is a directed graph in which each edge is associated to a unique value called weight. These networks are very suitable for modeling real-world social networks in which there is an assessment of one vertex toward other vertices. One of the main problems studied in this paper is prediction of edge weights in such networks. We introduce, for the first time, a metric geometry approach to studying edge weight prediction in WDNs. We modify a usual notion of WDNs, and introduce a new type of WDNs which we coin the term \textit{almost-weighted directed networks} (AWDNs). AWDNs can capture the weight information of a network from a given training set. We then construct a class of metrics (or distances) for AWDNs which equips such networks with a metric space structure. Using the metric geometry structure of AWDNs, we propose modified $k$ nearest neighbors (kNN) methods and modified support-vector machine (SVM) methods which will then be used to predict edge weights in AWDNs. In many real-world datasets, in addition to edge weights, one can also associate weights to vertices which capture information of vertices; association of weights to vertices especially plays an important role in graph embedding problems. Adopting a similar approach, we introduce two new types of directed networks in which weights are associated to either a subset of origin vertices or a subset of terminal vertices . We, for the first time, construct novel classes of metrics on such networks, and based on these new metrics propose modified $k$NN and SVM methods for predicting weights of origins and terminals in these networks. We provide experimental results on several real-world datasets, using our geometric methodologies.
△ Less
Submitted 29 September, 2020;
originally announced September 2020.
-
Exploiting Direct and Indirect Information for Friend Suggestion in ZingMe
Authors:
Kien Duy Nguyen,
Tuan Pham Minh,
Quang Nhat Nguyen,
Thanh Trung Nguyen
Abstract:
Friend suggestion is a fundamental problem in social networks with the goal of assisting users in creating more relationships, and thereby enhances interest of users to the social networks. This problem is often considered to be the link prediction problem in the network. ZingMe is one of the largest social networks in Vietnam. In this paper, we analyze the current approach for the friend suggesti…
▽ More
Friend suggestion is a fundamental problem in social networks with the goal of assisting users in creating more relationships, and thereby enhances interest of users to the social networks. This problem is often considered to be the link prediction problem in the network. ZingMe is one of the largest social networks in Vietnam. In this paper, we analyze the current approach for the friend suggestion problem in ZingMe, showing its limitations and disadvantages. We propose a new efficient approach for friend suggestion that uses information from the network structure, attributes and interactions of users to create resources for the evaluation of friend connection amongst users. Friend connection is evaluated exploiting both direct communication between the users and information from other ones in the network. The proposed approach has been implemented in a new system version of ZingMe. We conducted experiments, exploiting a dataset derived from the users' real use of ZingMe, to compare the newly proposed approach to the current approach and some well-known ones for the accuracy of friend suggestion. The experimental results show that the newly proposed approach outperforms the current one, i.e., by an increase of 7% to 98% on average in the friend suggestion accuracy. The proposed approach also outperforms other ones for users who have a small number of friends with improvements from 20% to 85% on average. In this paper, we also discuss a number of open issues and possible improvements for the proposed approach.
△ Less
Submitted 15 November, 2013;
originally announced November 2013.