Search | arXiv e-print repository

FDLoRA: Personalized Federated Learning of Large Language Model via Dual LoRA Tuning

Authors: Jiaxing QI, Zhongzhi Luan, Shaohan Huang, Carol Fung, Hailong Yang, Depei Qian

Abstract: Large language models (LLMs) have emerged as important components across various fields, yet their training requires substantial computation resources and abundant labeled data. It poses a challenge to robustly training LLMs for individual users (clients). To tackle this challenge, the intuitive idea is to introduce federated learning (FL), which can collaboratively train models on distributed pri… ▽ More Large language models (LLMs) have emerged as important components across various fields, yet their training requires substantial computation resources and abundant labeled data. It poses a challenge to robustly training LLMs for individual users (clients). To tackle this challenge, the intuitive idea is to introduce federated learning (FL), which can collaboratively train models on distributed private data. However, existing methods suffer from the challenges of data heterogeneity, system heterogeneity, and model size, resulting in suboptimal performance and high costs. In this work, we proposed a variant of personalized federated learning (PFL) framework, namely FDLoRA, which allows the client to be a single device or a cluster and adopts low-rank adaptation (LoRA) tuning. FDLoRA sets dual LoRA modules on each client to capture personalized and global knowledge, respectively, and only the global LoRA module uploads parameters to the central server to aggregate cross-client knowledge. Finally, an adaptive fusion approach is employed to combine the parameters of the dual LoRAs. This enables FDLoRA to make effective use of private data distributed across different clients, thereby improving performance on the client without incurring high communication and computing costs. We conducted extensive experiments in two practice scenarios. The results demonstrate that FDLoRA outperforms six baselines in terms of performance, stability, robustness, computation cost, and communication cost. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2404.08562 [pdf, other]

Dynamic Neural Control Flow Execution: An Agent-Based Deep Equilibrium Approach for Binary Vulnerability Detection

Authors: Litao Li, Steven H. H. Ding, Andrew Walenstein, Philippe Charland, Benjamin C. M. Fung

Abstract: Software vulnerabilities are a challenge in cybersecurity. Manual security patches are often difficult and slow to be deployed, while new vulnerabilities are created. Binary code vulnerability detection is less studied and more complex compared to source code, and this has important practical implications. Deep learning has become an efficient and powerful tool in the security domain, where it pro… ▽ More Software vulnerabilities are a challenge in cybersecurity. Manual security patches are often difficult and slow to be deployed, while new vulnerabilities are created. Binary code vulnerability detection is less studied and more complex compared to source code, and this has important practical implications. Deep learning has become an efficient and powerful tool in the security domain, where it provides end-to-end and accurate prediction. Modern deep learning approaches learn the program semantics through sequence and graph neural networks, using various intermediate representation of programs, such as abstract syntax trees (AST) or control flow graphs (CFG). Due to the complex nature of program execution, the output of an execution depends on the many program states and inputs. Also, a CFG generated from static analysis can be an overestimation of the true program flow. Moreover, the size of programs often does not allow a graph neural network with fixed layers to aggregate global information. To address these issues, we propose DeepEXE, an agent-based implicit neural network that mimics the execution path of a program. We use reinforcement learning to enhance the branching decision at every program state transition and create a dynamic environment to learn the dependency between a vulnerability and certain program states. An implicitly defined neural network enables nearly infinite state transitions until convergence, which captures the structural information at a higher level. The experiments are conducted on two semi-synthetic and two real-world datasets. We show that DeepEXE is an accurate and efficient method and outperforms the state-of-the-art vulnerability detection methods. △ Less

Submitted 3 April, 2024; originally announced April 2024.

arXiv:2402.01905 [pdf, other]

Carthago Delenda Est: Co-opetitive Indirect Information Diffusion Model for Influence Operations on Online Social Media

Authors: Jwen Fai Low, Benjamin C. M. Fung, Farkhund Iqbal, Claude Fachkha

Abstract: For a state or non-state actor whose credibility is bankrupt, relying on bots to conduct non-attributable, non-accountable, and seemingly-grassroots-but-decentralized-in-actuality influence/information operations (info ops) on social media can help circumvent the issue of trust deficit while advancing its interests. Planning and/or defending against decentralized info ops can be aided by computati… ▽ More For a state or non-state actor whose credibility is bankrupt, relying on bots to conduct non-attributable, non-accountable, and seemingly-grassroots-but-decentralized-in-actuality influence/information operations (info ops) on social media can help circumvent the issue of trust deficit while advancing its interests. Planning and/or defending against decentralized info ops can be aided by computational simulations in lieu of ethically-fraught live experiments on social media. In this study, we introduce Diluvsion, an agent-based model for contested information propagation efforts on Twitter-like social media. The model emphasizes a user's belief in an opinion (stance) being impacted by the perception of potentially illusory popular support from constant incoming floods of indirect information, floods that can be cooperatively engineered in an uncoordinated manner by bots as they compete to spread their stances. Our model, which has been validated against real-world data, is an advancement over previous models because we account for engagement metrics in influencing stance adoption, non-social tie spreading of information, neutrality as a stance that can be spread, and themes that are analogous to media's framing effect and are symbiotic with respect to stance propagation. The strengths of the Diluvsion model are demonstrated in simulations of orthodox info ops, e.g., maximizing adoption of one stance; creating echo chambers; inducing polarization; and unorthodox info ops, e.g., simultaneous support of multiple stances as a Trojan horse tactic for the dissemination of a theme. △ Less

Submitted 6 February, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

Comments: 60 pages, 9 figures, 1 table

arXiv:2310.10461 [pdf, other]

Model Selection of Zero-shot Anomaly Detectors in the Absence of Labeled Validation Data

Authors: Clement Fung, Chen Qiu, Aodong Li, Maja Rudolph

Abstract: Anomaly detection requires detecting abnormal samples in large unlabeled datasets. While progress in deep learning and the advent of foundation models has produced powerful zero-shot anomaly detection methods, their deployment in practice is often hindered by the lack of labeled data -- without it, their detection performance cannot be evaluated reliably. In this work, we propose SWSA (Selection W… ▽ More Anomaly detection requires detecting abnormal samples in large unlabeled datasets. While progress in deep learning and the advent of foundation models has produced powerful zero-shot anomaly detection methods, their deployment in practice is often hindered by the lack of labeled data -- without it, their detection performance cannot be evaluated reliably. In this work, we propose SWSA (Selection With Synthetic Anomalies): a general-purpose framework to select image-based anomaly detectors with a generated synthetic validation set. Our proposed anomaly generation method assumes access to only a small support set of normal images and requires no training or fine-tuning. Once generated, our synthetic validation set is used to create detection tasks that compose a validation framework for model selection. In an empirical study, we find that SWSA often selects models that match selections made with a ground-truth validation set, resulting in higher AUROCs than baseline methods. We also find that SWSA selects prompts for CLIP-based anomaly detection that outperform baseline prompt selection strategies on all datasets, including the challenging MVTec-AD and VisA datasets. △ Less

Submitted 9 February, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

Comments: 14 pages

arXiv:2309.01189 [pdf, other]

LogGPT: Exploring ChatGPT for Log-Based Anomaly Detection

Authors: Jiaxing Qi, Shaohan Huang, Zhongzhi Luan, Carol Fung, Hailong Yang, Depei Qian

Abstract: The increasing volume of log data produced by software-intensive systems makes it impractical to analyze them manually. Many deep learning-based methods have been proposed for log-based anomaly detection. These methods face several challenges such as high-dimensional and noisy log data, class imbalance, generalization, and model interpretability. Recently, ChatGPT has shown promising results in va… ▽ More The increasing volume of log data produced by software-intensive systems makes it impractical to analyze them manually. Many deep learning-based methods have been proposed for log-based anomaly detection. These methods face several challenges such as high-dimensional and noisy log data, class imbalance, generalization, and model interpretability. Recently, ChatGPT has shown promising results in various domains. However, there is still a lack of study on the application of ChatGPT for log-based anomaly detection. In this work, we proposed LogGPT, a log-based anomaly detection framework based on ChatGPT. By leveraging the ChatGPT's language interpretation capabilities, LogGPT aims to explore the transferability of knowledge from large-scale corpora to log-based anomaly detection. We conduct experiments to evaluate the performance of LogGPT and compare it with three deep learning-based methods on BGL and Spirit datasets. LogGPT shows promising results and has good interpretability. This study provides preliminary insights into prompt-based models, such as ChatGPT, for the log-based anomaly detection task. △ Less

Submitted 3 September, 2023; originally announced September 2023.

arXiv:2307.10631 [pdf, other]

Pluvio: Assembly Clone Search for Out-of-domain Architectures and Libraries through Transfer Learning and Conditional Variational Information Bottleneck

Authors: Zhiwei Fu, Steven H. H. Ding, Furkan Alaca, Benjamin C. M. Fung, Philippe Charland

Abstract: The practice of code reuse is crucial in software development for a faster and more efficient development lifecycle. In reality, however, code reuse practices lack proper control, resulting in issues such as vulnerability propagation and intellectual property infringements. Assembly clone search, a critical shift-right defence mechanism, has been effective in identifying vulnerable code resulting… ▽ More The practice of code reuse is crucial in software development for a faster and more efficient development lifecycle. In reality, however, code reuse practices lack proper control, resulting in issues such as vulnerability propagation and intellectual property infringements. Assembly clone search, a critical shift-right defence mechanism, has been effective in identifying vulnerable code resulting from reuse in released executables. Recent studies on assembly clone search demonstrate a trend towards using machine learning-based methods to match assembly code variants produced by different toolchains. However, these methods are limited to what they learn from a small number of toolchain variants used in training, rendering them inapplicable to unseen architectures and their corresponding compilation toolchain variants. This paper presents the first study on the problem of assembly clone search with unseen architectures and libraries. We propose incorporating human common knowledge through large-scale pre-trained natural language models, in the form of transfer learning, into current learning-based approaches for assembly clone search. Transfer learning can aid in addressing the limitations of the existing approaches, as it can bring in broader knowledge from human experts in assembly code. We further address the sequence limit issue by proposing a reinforcement learning agent to remove unnecessary and redundant tokens. Coupled with a new Variational Information Bottleneck learning strategy, the proposed system minimizes the reliance on potential indicators of architectures and optimization settings, for a better generalization of unseen architectures. We simulate the unseen architecture clone search scenarios and the experimental results show the effectiveness of the proposed approach against the state-of-the-art solutions. △ Less

Submitted 20 July, 2023; originally announced July 2023.

Comments: 13 pages and 4 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2303.11715 [pdf, other]

LogQA: Question Answering in Unstructured Logs

Authors: Shaohan Huang, Yi Liu, Carol Fung, Jiaxing Qi, Hailong Yang, Zhongzhi Luan

Abstract: Modern systems produce a large volume of logs to record run-time status and events. System operators use these raw logs to track a system in order to obtain some useful information to diagnose system anomalies. One of the most important problems in this area is to help operators find the answers to log-based questions efficiently and user-friendly. In this work, we propose LogQA, which aims at ans… ▽ More Modern systems produce a large volume of logs to record run-time status and events. System operators use these raw logs to track a system in order to obtain some useful information to diagnose system anomalies. One of the most important problems in this area is to help operators find the answers to log-based questions efficiently and user-friendly. In this work, we propose LogQA, which aims at answering log-based questions in the form of natural language based on large-scale unstructured log corpora. Our system presents the answer to a question directly instead of returning a list of relevant snippets, thus offering better user-friendliness and efficiency. LogQA represents the first approach to solve question answering in lod domain. LogQA has two key components: Log Retriever and Log Reader. Log Retriever aims at retrieving relevant logs w.r.t. a given question, while Log Reader is responsible for inferring the final answer. Given the lack of a public dataset for log questing answering, we manually labelled a QA dataset of three open-source log corpus and will make them publicly available. We evaluated our proposed model on these datasets by comparing its performance with 6 other baseline methods. Our experimental results demonstrate that LogQA has outperformed other baseline methods. △ Less

Submitted 21 March, 2023; originally announced March 2023.

arXiv:2210.11711 [pdf, ps, other]

Modelling Multi-relations for Convolutional-based Knowledge Graph Embedding

Authors: Sirui Li, Kok Wai Wong, Dengya Zhu, Chun Che Fung

Abstract: Representation learning of knowledge graphs aims to embed entities and relations into low-dimensional vectors. Most existing works only consider the direct relations or paths between an entity pair. It is considered that such approaches disconnect the semantic connection of multi-relations between an entity pair, and we propose a convolutional and multi-relational representation learning model, Co… ▽ More Representation learning of knowledge graphs aims to embed entities and relations into low-dimensional vectors. Most existing works only consider the direct relations or paths between an entity pair. It is considered that such approaches disconnect the semantic connection of multi-relations between an entity pair, and we propose a convolutional and multi-relational representation learning model, ConvMR. The proposed ConvMR model addresses the multi-relation issue in two aspects: (1) Encoding the multi-relations between an entity pair into a unified vector that maintains the semantic connection. (2) Since not all relations are necessary while joining multi-relations, we propose an attention-based relation encoder to automatically assign weights to different relations based on semantic hierarchy. Experimental results on two popular datasets, FB15k-237 and WN18RR, achieved consistent improvements on the mean rank. We also found that ConvMR is efficient to deal with less frequent entities. △ Less

Submitted 20 October, 2022; originally announced October 2022.

Comments: Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 26th International Conference KES2022

arXiv:2209.15207 [pdf, other]

Mixture of experts models for multilevel data: modelling framework and approximation theory

Authors: Tsz Chai Fung, Spark C. Tseung

Abstract: Multilevel data are prevalent in many real-world applications. However, it remains an open research problem to identify and justify a class of models that flexibly capture a wide range of multilevel data. Motivated by the versatility of the mixture of experts (MoE) models in fitting regression data, in this article we extend upon the MoE and study a class of mixed MoE (MMoE) models for multilevel… ▽ More Multilevel data are prevalent in many real-world applications. However, it remains an open research problem to identify and justify a class of models that flexibly capture a wide range of multilevel data. Motivated by the versatility of the mixture of experts (MoE) models in fitting regression data, in this article we extend upon the MoE and study a class of mixed MoE (MMoE) models for multilevel data. Under some regularity conditions, we prove that the MMoE is dense in the space of any continuous mixed effects models in the sense of weak convergence. As a result, the MMoE has a potential to accurately resemble almost all characteristics inherited in multilevel data, including the marginal distributions, dependence structures, regression links, random intercepts and random slopes. In a particular case where the multilevel data is hierarchical, we further show that a nested version of the MMoE universally approximates a broad range of dependence structures of the random effects among different factor levels. △ Less

Submitted 29 September, 2022; originally announced September 2022.

arXiv:2203.13570 [pdf, other]

Improving Question Answering over Knowledge Graphs Using Graph Summarization

Authors: Sirui Li, Kok Kai Wong, Dengya Zhu, Chun Che Fung

Abstract: Question Answering (QA) systems over Knowledge Graphs (KGs) (KGQA) automatically answer natural language questions using triples contained in a KG. The key idea is to represent questions and entities of a KG as low-dimensional embeddings. Previous KGQAs have attempted to represent entities using Knowledge Graph Embedding (KGE) and Deep Learning (DL) methods. However, KGEs are too shallow to captur… ▽ More Question Answering (QA) systems over Knowledge Graphs (KGs) (KGQA) automatically answer natural language questions using triples contained in a KG. The key idea is to represent questions and entities of a KG as low-dimensional embeddings. Previous KGQAs have attempted to represent entities using Knowledge Graph Embedding (KGE) and Deep Learning (DL) methods. However, KGEs are too shallow to capture the expressive features and DL methods process each triple independently. Recently, Graph Convolutional Network (GCN) has shown to be excellent in providing entity embeddings. However, using GCNs to KGQAs is inefficient because GCNs treat all relations equally when aggregating neighbourhoods. Also, a problem could occur when using previous KGQAs: in most cases, questions often have an uncertain number of answers. To address the above issues, we propose a graph summarization technique using Recurrent Convolutional Neural Network (RCNN) and GCN. The combination of GCN and RCNN ensures that the embeddings are propagated together with the relations relevant to the question, and thus better answers. The proposed graph summarization technique can be used to tackle the issue that KGQAs cannot answer questions with an uncertain number of answers. In this paper, we demonstrated the proposed technique on the most common type of questions, which is single-relation questions. Experiments have demonstrated that the proposed graph summarization technique using RCNN and GCN can provide better results when compared to the GCN. The proposed graph summarization technique significantly improves the recall of actual answers when the questions have an uncertain number of answers. △ Less

Submitted 25 March, 2022; originally announced March 2022.

Comments: The paper is accepted by ICONIP 2021

arXiv:2111.08842 [pdf, other]

doi 10.1109/MC.2021.3125611

Privacy Guarantees of BLE Contact Tracing: A Case Study on COVIDWISE

Authors: Salman Ahmed, Ya Xiao, Taejoong, Chung, Carol Fung, Moti Yung, Danfeng, Yao

Abstract: Google and Apple jointly introduced a digital contact tracing technology and an API called "exposure notification," to help health organizations and governments with contact tracing. The technology and its interplay with security and privacy constraints require investigation. In this study, we examine and analyze the security, privacy, and reliability of the technology with actual and typical scen… ▽ More Google and Apple jointly introduced a digital contact tracing technology and an API called "exposure notification," to help health organizations and governments with contact tracing. The technology and its interplay with security and privacy constraints require investigation. In this study, we examine and analyze the security, privacy, and reliability of the technology with actual and typical scenarios (and expected typical adversary in mind), and quite realistic use cases. We do it in the context of Virginia's COVIDWISE app. This experimental analysis validates the properties of the system under the above conditions, a result that seems crucial for the peace of mind of the exposure notification technology adopting authorities, and may also help with the system's transparency and overall user trust. △ Less

Submitted 16 December, 2021; v1 submitted 16 November, 2021; originally announced November 2021.

Comments: \{copyright} 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Journal ref: IEEE Computer 2021

arXiv:2111.02303 [pdf, other]

On the Effectiveness of Interpretable Feedforward Neural Network

Authors: Miles Q. Li, Benjamin C. M. Fung, Adel Abusitta

Abstract: Deep learning models have achieved state-of-the-art performance in many classification tasks. However, most of them cannot provide an interpretation for their classification results. Machine learning models that are interpretable are usually linear or piecewise linear and yield inferior performance. Non-linear models achieve much better classification performance, but it is hard to interpret their… ▽ More Deep learning models have achieved state-of-the-art performance in many classification tasks. However, most of them cannot provide an interpretation for their classification results. Machine learning models that are interpretable are usually linear or piecewise linear and yield inferior performance. Non-linear models achieve much better classification performance, but it is hard to interpret their classification results. This may have been changed by an interpretable feedforward neural network (IFFNN) proposed that achieves both high classification performance and interpretability for malware detection. If the IFFNN can perform well in a more flexible and general form for other classification tasks while providing meaningful interpretations, it may be of great interest to the applied machine learning community. In this paper, we propose a way to generalize the interpretable feedforward neural network to multi-class classification scenarios and any type of feedforward neural networks, and evaluate its classification performance and interpretability on intrinsic interpretable datasets. We conclude by finding that the generalized IFFNNs achieve comparable classification performance to their normal feedforward neural network counterparts and provide meaningful interpretations. Thus, this kind of neural network architecture has great practical use. △ Less

Submitted 3 November, 2021; originally announced November 2021.

arXiv:2110.08254 [pdf, other]

Inconsistent Few-Shot Relation Classification via Cross-Attentional Prototype Networks with Contrastive Learning

Authors: Hongru Wang, Zhijing Jin, Jiarun Cao, Gabriel Pui Cheong Fung, Kam-Fai Wong

Abstract: Standard few-shot relation classification (RC) is designed to learn a robust classifier with only few labeled data for each class. However, previous works rarely investigate the effects of a different number of classes (i.e., $N$-way) and number of labeled data per class (i.e., $K$-shot) during training vs. testing. In this work, we define a new task, \textit{inconsistent few-shot RC}, where the m… ▽ More Standard few-shot relation classification (RC) is designed to learn a robust classifier with only few labeled data for each class. However, previous works rarely investigate the effects of a different number of classes (i.e., $N$-way) and number of labeled data per class (i.e., $K$-shot) during training vs. testing. In this work, we define a new task, \textit{inconsistent few-shot RC}, where the model needs to handle the inconsistency of $N$ and $K$ between training and testing. To address this new task, we propose Prototype Network-based cross-attention contrastive learning (ProtoCACL) to capture the rich mutual interactions between the support set and query set. Experimental results demonstrate that our ProtoCACL can outperform the state-of-the-art baseline model under both inconsistent $K$ and inconsistent $N$ settings, owing to its more robust and discriminate representations. Moreover, we identify that in the inconsistent few-shot learning setting, models can achieve better performance with \textit{less data} than the standard few-shot setting with carefully-selected $N$ and $K$. In the end of the paper, we provide further analyses and suggestions to systematically guide the selection of $N$ and $K$ under different scenarios. △ Less

Submitted 13 October, 2021; originally announced October 2021.

arXiv:2109.05234 [pdf, other]

Prior Omission of Dissimilar Source Domain(s) for Cost-Effective Few-Shot Learning

Authors: Zezhong Wang, Hongru Wang, Kwan Wai Chung, Jia Zhu, Gabriel Pui Cheong Fung, Kam-Fai Wong

Abstract: Few-shot slot tagging is an emerging research topic in the field of Natural Language Understanding (NLU). With sufficient annotated data from source domains, the key challenge is how to train and adapt the model to another target domain which only has few labels. Conventional few-shot approaches use all the data from the source domains without considering inter-domain relations and implicitly assu… ▽ More Few-shot slot tagging is an emerging research topic in the field of Natural Language Understanding (NLU). With sufficient annotated data from source domains, the key challenge is how to train and adapt the model to another target domain which only has few labels. Conventional few-shot approaches use all the data from the source domains without considering inter-domain relations and implicitly assume each sample in the domain contributes equally. However, our experiments show that the data distribution bias among different domains will significantly affect the adaption performance. Moreover, transferring knowledge from dissimilar domains will even introduce some extra noises so that affect the performance of models. To tackle this problem, we propose an effective similarity-based method to select data from the source domains. In addition, we propose a Shared-Private Network (SP-Net) for the few-shot slot tagging task. The words from the same class would have some shared features. We extract those shared features from the limited annotated data on the target domain and merge them together as the label embedding to help us predict other unlabelled data on the target domain. The experiment shows that our method outperforms the state-of-the-art approaches with fewer source data. The result also proves that some training data from dissimilar sources are redundant and even negative for the adaption. △ Less

Submitted 11 September, 2021; originally announced September 2021.

arXiv:2109.05187 [pdf, other]

TopicRefine: Joint Topic Prediction and Dialogue Response Generation for Multi-turn End-to-End Dialogue System

Authors: Hongru Wang, Mingyu Cui, Zimo Zhou, Gabriel Pui Cheong Fung, Kam-Fai Wong

Abstract: A multi-turn dialogue always follows a specific topic thread, and topic shift at the discourse level occurs naturally as the conversation progresses, necessitating the model's ability to capture different topics and generate topic-aware responses. Previous research has either predicted the topic first and then generated the relevant response, or simply applied the attention mechanism to all topics… ▽ More A multi-turn dialogue always follows a specific topic thread, and topic shift at the discourse level occurs naturally as the conversation progresses, necessitating the model's ability to capture different topics and generate topic-aware responses. Previous research has either predicted the topic first and then generated the relevant response, or simply applied the attention mechanism to all topics, ignoring the joint distribution of the topic prediction and response generation models and resulting in uncontrollable and unrelated responses. In this paper, we propose a joint framework with a topic refinement mechanism to learn these two tasks simultaneously. Specifically, we design a three-pass iteration mechanism to generate coarse response first, then predict corresponding topics, and finally generate refined response conditioned on predicted topics. Moreover, we utilize GPT2DoubleHeads and BERT for the topic prediction task respectively, aiming to investigate the effects of joint learning and the understanding ability of GPT model. Experimental results demonstrate that our proposed framework achieves new state-of-the-art performance at response generation task and the great potential understanding capability of GPT model. △ Less

Submitted 11 September, 2021; originally announced September 2021.

arXiv:2104.08530 [pdf, other]

The Topic Confusion Task: A Novel Scenario for Authorship Attribution

Authors: Malik H. Altakrori, Jackie Chi Kit Cheung, Benjamin C. M. Fung

Abstract: Authorship attribution is the problem of identifying the most plausible author of an anonymous text from a set of candidate authors. Researchers have investigated same-topic and cross-topic scenarios of authorship attribution, which differ according to whether new, unseen topics are used in the testing phase. However, neither scenario allows us to explain whether errors are caused by a failure to… ▽ More Authorship attribution is the problem of identifying the most plausible author of an anonymous text from a set of candidate authors. Researchers have investigated same-topic and cross-topic scenarios of authorship attribution, which differ according to whether new, unseen topics are used in the testing phase. However, neither scenario allows us to explain whether errors are caused by a failure to capture authorship writing style or by a topic shift. Motivated by this, we propose the \emph{topic confusion} task where we switch the author-topic configuration between the training and testing sets. This setup allows us to distinguish two types of errors: those caused by the topic shift and those caused by the features' inability to capture the writing styles. We show that stylometric features with part-of-speech tags are the least susceptible to topic variations. We further show that combining them with other features leads to significantly lower topic confusion and higher attribution accuracy. Finally, we show that pretrained language models such as BERT and RoBERTa perform poorly on this task and are surpassed by simple features such as word-level $n$-grams. △ Less

Submitted 9 September, 2021; v1 submitted 17 April, 2021; originally announced April 2021.

Comments: 15 pages (9 + ref./appin.), 6 figures, Accepted to Findings of EMNLP 2021

arXiv:2011.08772 [pdf, other]

KddRES: A Multi-level Knowledge-driven Dialogue Dataset for Restaurant Towards Customized Dialogue System

Authors: Hongru Wang, Min Li, Zimo Zhou, Gabriel Pui Cheong Fung, Kam-Fai Wong

Abstract: Compared with CrossWOZ (Chinese) and MultiWOZ (English) dataset which have coarse-grained information, there is no dataset which handle fine-grained and hierarchical level information properly. In this paper, we publish a first Cantonese knowledge-driven Dialogue Dataset for REStaurant (KddRES) in Hong Kong, which grounds the information in multi-turn conversations to one specific restaurant. Our… ▽ More Compared with CrossWOZ (Chinese) and MultiWOZ (English) dataset which have coarse-grained information, there is no dataset which handle fine-grained and hierarchical level information properly. In this paper, we publish a first Cantonese knowledge-driven Dialogue Dataset for REStaurant (KddRES) in Hong Kong, which grounds the information in multi-turn conversations to one specific restaurant. Our corpus contains 0.8k conversations which derive from 10 restaurants with various styles in different regions. In addition to that, we designed fine-grained slots and intents to better capture semantic information. The benchmark experiments and data statistic analysis show the diversity and rich annotations of our dataset. We believe the publish of KddRES can be a necessary supplement of current dialogue datasets and more suitable and valuable for small and middle enterprises (SMEs) of society, such as build a customized dialogue system for each restaurant. The corpus and benchmark models are publicly available. △ Less

Submitted 14 December, 2021; v1 submitted 17 November, 2020; originally announced November 2020.

Comments: 8 pages,2 figures

arXiv:2011.06301 [pdf, other]

doi 10.1109/TKDE.2020.3038211

Learning Inter-Modal Correspondence and Phenotypes from Multi-Modal Electronic Health Records

Authors: Kejing Yin, William K. Cheung, Benjamin C. M. Fung, Jonathan Poon

Abstract: Non-negative tensor factorization has been shown a practical solution to automatically discover phenotypes from the electronic health records (EHR) with minimal human supervision. Such methods generally require an input tensor describing the inter-modal interactions to be pre-established; however, the correspondence between different modalities (e.g., correspondence between medications and diagnos… ▽ More Non-negative tensor factorization has been shown a practical solution to automatically discover phenotypes from the electronic health records (EHR) with minimal human supervision. Such methods generally require an input tensor describing the inter-modal interactions to be pre-established; however, the correspondence between different modalities (e.g., correspondence between medications and diagnoses) can often be missing in practice. Although heuristic methods can be applied to estimate them, they inevitably introduce errors, and leads to sub-optimal phenotype quality. This is particularly important for patients with complex health conditions (e.g., in critical care) as multiple diagnoses and medications are simultaneously present in the records. To alleviate this problem and discover phenotypes from EHR with unobserved inter-modal correspondence, we propose the collective hidden interaction tensor factorization (cHITF) to infer the correspondence between multiple modalities jointly with the phenotype discovery. We assume that the observed matrix for each modality is marginalization of the unobserved inter-modal correspondence, which are reconstructed by maximizing the likelihood of the observed matrices. Extensive experiments conducted on the real-world MIMIC-III dataset demonstrate that cHITF effectively infers clinically meaningful inter-modal correspondence, discovers phenotypes that are more clinically relevant and diverse, and achieves better predictive performance compared with a number of state-of-the-art computational phenotyping models. △ Less

Submitted 12 November, 2020; originally announced November 2020.

Comments: Accepted by IEEE Transactions on Knowledge and Data Engineering (TKDE)

arXiv:2006.09161 [pdf, other]

CUHK at SemEval-2020 Task 4: CommonSense Explanation, Reasoning and Prediction with Multi-task Learning

Authors: Hongru Wang, Xiangru Tang, Sunny Lai, Kwong Sak Leung, Jia Zhu, Gabriel Pui Cheong Fung, Kam-Fai Wong

Abstract: This paper describes our system submitted to task 4 of SemEval 2020: Commonsense Validation and Explanation (ComVE) which consists of three sub-tasks. The task is to directly validate the given sentence whether or not it makes sense and require the model to explain it. Based on BERTarchitecture with a multi-task setting, we propose an effective and interpretable "Explain, Reason and Predict" (ERP)… ▽ More This paper describes our system submitted to task 4 of SemEval 2020: Commonsense Validation and Explanation (ComVE) which consists of three sub-tasks. The task is to directly validate the given sentence whether or not it makes sense and require the model to explain it. Based on BERTarchitecture with a multi-task setting, we propose an effective and interpretable "Explain, Reason and Predict" (ERP) system to solve the three sub-tasks about commonsense: (a) Validation, (b)Reasoning, and (c) Explanation. Inspired by cognitive studies of common sense, our system first generates a reason or understanding of the sentences and then chooses which one statement makes sense, which is achieved by multi-task learning. During the post-evaluation, our system has reached 92.9% accuracy in subtask A (rank 11), 89.7% accuracy in subtask B (rank 9), andBLEU score of 12.9 in subtask C (rank 8) △ Less

Submitted 27 July, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

arXiv:2006.06862

Deep Learning-based Stress Determinator for Mouse Psychiatric Analysis using Hippocampus Activity

Authors: Donghan Liu, Benjamin C. M. Fung, Tak Pan Wong

Abstract: Decoding neurons to extract information from transmission and employ them into other use is the goal of neuroscientists' study. Due to that the field of neuroscience is utilizing the traditional methods presently, we hence combine the state-of-the-art deep learning techniques with the theory of neuron decoding to discuss its potential of accomplishment. Besides, the stress level that is related to… ▽ More Decoding neurons to extract information from transmission and employ them into other use is the goal of neuroscientists' study. Due to that the field of neuroscience is utilizing the traditional methods presently, we hence combine the state-of-the-art deep learning techniques with the theory of neuron decoding to discuss its potential of accomplishment. Besides, the stress level that is related to neuron activity in hippocampus is statistically examined as well. The experiments suggest that our state-of-the-art deep learning-based stress determinator provides good performance with respect to its model prediction accuracy and additionally, there is strong evidence against equivalence of mouse stress level under diverse environments. △ Less

Submitted 27 June, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

Comments: The paper need re-evaluated and reviewed, may cause some significant changes

arXiv:1909.06865 [pdf, other]

doi 10.1016/j.cose.2021.102371

I-MAD: Interpretable Malware Detector Using Galaxy Transformer

Authors: Miles Q. Li, Benjamin C. M. Fung, Philippe Charland, Steven H. H. Ding

Abstract: Malware currently presents a number of serious threats to computer users. Signature-based malware detection methods are limited in detecting new malware samples that are significantly different from known ones. Therefore, machine learning-based methods have been proposed, but there are two challenges these methods face. The first is to model the full semantics behind the assembly code of malware.… ▽ More Malware currently presents a number of serious threats to computer users. Signature-based malware detection methods are limited in detecting new malware samples that are significantly different from known ones. Therefore, machine learning-based methods have been proposed, but there are two challenges these methods face. The first is to model the full semantics behind the assembly code of malware. The second challenge is to provide interpretable results while keeping excellent detection performance. In this paper, we propose an Interpretable MAlware Detector (I-MAD) that outperforms state-of-the-art static malware detection models regarding accuracy with excellent interpretability. To improve the detection performance, I-MAD incorporates a novel network component called the Galaxy Transformer network that can understand assembly code at the basic block, function, and executable levels. It also incorporates our proposed interpretable feed-forward neural network to provide interpretations for its detection results by quantifying the impact of each feature with respect to the prediction. Experiment results show that our model significantly outperforms existing state-of-the-art static malware detection models and presents meaningful interpretations. △ Less

Submitted 20 June, 2021; v1 submitted 15 September, 2019; originally announced September 2019.

Comments: Published by Elsevier Computers & Security

arXiv:1907.08736 [pdf, other]

ER-AE: Differentially Private Text Generation for Authorship Anonymization

Authors: Haohan Bo, Steven H. H. Ding, Benjamin C. M. Fung, Farkhund Iqbal

Abstract: Most of privacy protection studies for textual data focus on removing explicit sensitive identifiers. However, personal writing style, as a strong indicator of the authorship, is often neglected. Recent studies, such as SynTF, have shown promising results on privacy-preserving text mining. However, their anonymization algorithm can only output numeric term vectors which are difficult for the recip… ▽ More Most of privacy protection studies for textual data focus on removing explicit sensitive identifiers. However, personal writing style, as a strong indicator of the authorship, is often neglected. Recent studies, such as SynTF, have shown promising results on privacy-preserving text mining. However, their anonymization algorithm can only output numeric term vectors which are difficult for the recipients to interpret. We propose a novel text generation model with a two-set exponential mechanism for authorship anonymization. By augmenting the semantic information through a REINFORCE training reward function, the model can generate differentially private text that has a close semantic and similar grammatical structure to the original text while removing personal traits of the writing style. It does not assume any conditioned labels or paralleled text data for training. We evaluate the performance of the proposed model on the real-life peer reviews dataset and the Yelp review dataset. The result suggests that our model outperforms the state-of-the-art on semantic preservation, authorship obfuscation, and stylometric transformation. △ Less

Submitted 13 May, 2021; v1 submitted 19 July, 2019; originally announced July 2019.

arXiv:1811.09904 [pdf, other]

Biscotti: A Ledger for Private and Secure Peer-to-Peer Machine Learning

Authors: Muhammad Shayan, Clement Fung, Chris J. M. Yoon, Ivan Beschastnikh

Abstract: Federated Learning is the current state of the art in supporting secure multi-party machine learning (ML): data is maintained on the owner's device and the updates to the model are aggregated through a secure protocol. However, this process assumes a trusted centralized infrastructure for coordination, and clients must trust that the central service does not use the byproducts of client data. In a… ▽ More Federated Learning is the current state of the art in supporting secure multi-party machine learning (ML): data is maintained on the owner's device and the updates to the model are aggregated through a secure protocol. However, this process assumes a trusted centralized infrastructure for coordination, and clients must trust that the central service does not use the byproducts of client data. In addition to this, a group of malicious clients could also harm the performance of the model by carrying out a poisoning attack. As a response, we propose Biscotti: a fully decentralized peer to peer (P2P) approach to multi-party ML, which uses blockchain and cryptographic primitives to coordinate a privacy-preserving ML process between peering clients. Our evaluation demonstrates that Biscotti is scalable, fault tolerant, and defends against known attacks. For example, Biscotti is able to protect the privacy of an individual client's update and the performance of the global model at scale when 30% of adversaries are trying to poison the model. The implementation can be found at: https://github.com/DistributedML/Biscotti △ Less

Submitted 11 December, 2019; v1 submitted 24 November, 2018; originally announced November 2018.

Comments: 20 pages

arXiv:1811.09712 [pdf, other]

Dancing in the Dark: Private Multi-Party Machine Learning in an Untrusted Setting

Authors: Clement Fung, Jamie Koerner, Stewart Grant, Ivan Beschastnikh

Abstract: Distributed machine learning (ML) systems today use an unsophisticated threat model: data sources must trust a central ML process. We propose a brokered learning abstraction that allows data sources to contribute towards a globally-shared model with provable privacy guarantees in an untrusted setting. We realize this abstraction by building on federated learning, the state of the art in multi-part… ▽ More Distributed machine learning (ML) systems today use an unsophisticated threat model: data sources must trust a central ML process. We propose a brokered learning abstraction that allows data sources to contribute towards a globally-shared model with provable privacy guarantees in an untrusted setting. We realize this abstraction by building on federated learning, the state of the art in multi-party ML, to construct TorMentor: an anonymous hidden service that supports private multi-party ML. We define a new threat model by characterizing, developing and evaluating new attacks in the brokered learning setting, along with new defenses for these attacks. We show that TorMentor effectively protects data providers against known ML attacks while providing them with a tunable trade-off between model accuracy and privacy. We evaluate TorMentor with local and geo-distributed deployments on Azure/Tor. In an experiment with 200 clients and 14 MB of data per client, our prototype trained a logistic regression model using stochastic gradient descent in 65s. Code is available at: https://github.com/DistributedML/TorML △ Less

Submitted 23 February, 2019; v1 submitted 23 November, 2018; originally announced November 2018.

Comments: 16 pages

arXiv:1808.05283 [pdf, other]

doi 10.1016/j.sysarc.2019.02.009

All One Needs to Know about Fog Computing and Related Edge Computing Paradigms: A Complete Survey

Authors: Ashkan Yousefpour, Caleb Fung, Tam Nguyen, Krishna Kadiyala, Fatemeh Jalali, Amirreza Niakanlahiji, Jian Kong, Jason P. Jue

Abstract: With the Internet of Things (IoT) becoming part of our daily life and our environment, we expect rapid growth in the number of connected devices. IoT is expected to connect billions of devices and humans to bring promising advantages for us. With this growth, fog computing, along with its related edge computing paradigms, such as multi-access edge computing (MEC) and cloudlet, are seen as promisin… ▽ More With the Internet of Things (IoT) becoming part of our daily life and our environment, we expect rapid growth in the number of connected devices. IoT is expected to connect billions of devices and humans to bring promising advantages for us. With this growth, fog computing, along with its related edge computing paradigms, such as multi-access edge computing (MEC) and cloudlet, are seen as promising solutions for handling the large volume of security-critical and time-sensitive data that is being produced by the IoT. In this paper, we first provide a tutorial on fog computing and its related computing paradigms, including their similarities and differences. Next, we provide a taxonomy of research topics in fog computing, and through a comprehensive survey, we summarize and categorize the efforts on fog computing and its related computing paradigms. Finally, we provide challenges and future directions for research in fog computing. △ Less

Submitted 13 February, 2019; v1 submitted 15 August, 2018; originally announced August 2018.

Comments: 48 pages, 7 tables, 11 figures, 450 references. The data (categories and features/objectives of the papers) of this survey are now available publicly. Accepted by Elsevier Journal of Systems Architecture

arXiv:1808.04866 [pdf, other]

Mitigating Sybils in Federated Learning Poisoning

Authors: Clement Fung, Chris J. M. Yoon, Ivan Beschastnikh

Abstract: Machine learning (ML) over distributed multi-party data is required for a variety of domains. Existing approaches, such as federated learning, collect the outputs computed by a group of devices at a central aggregator and run iterative algorithms to train a globally shared model. Unfortunately, such approaches are susceptible to a variety of attacks, including model poisoning, which is made substa… ▽ More Machine learning (ML) over distributed multi-party data is required for a variety of domains. Existing approaches, such as federated learning, collect the outputs computed by a group of devices at a central aggregator and run iterative algorithms to train a globally shared model. Unfortunately, such approaches are susceptible to a variety of attacks, including model poisoning, which is made substantially worse in the presence of sybils. In this paper we first evaluate the vulnerability of federated learning to sybil-based poisoning attacks. We then describe \emph{FoolsGold}, a novel defense to this problem that identifies poisoning sybils based on the diversity of client updates in the distributed learning process. Unlike prior work, our system does not bound the expected number of attackers, requires no auxiliary information outside of the learning process, and makes fewer assumptions about clients and their data. In our evaluation we show that FoolsGold exceeds the capabilities of existing state of the art approaches to countering sybil-based label-flipping and backdoor poisoning attacks. Our results hold for different distributions of client data, varying poisoning targets, and various sybil strategies. Code can be found at: https://github.com/DistributedML/FoolsGold △ Less

Submitted 15 July, 2020; v1 submitted 14 August, 2018; originally announced August 2018.

Comments: 16 pages, Extended technical version of conference paper "The Limitations of Federated Learning in Sybil Settings" accepted at RAID 2020

arXiv:1711.06710 [pdf, other]

Instant Accident Reporting and Crowdsensed Road Condition Analytics for Smart Cities

Authors: Ashkan Yousefpour, Caleb Fung, Tam Nguyen, David Hong, Daniel Zhang

Abstract: The following report contains information about a proposed technology by the authors, which consists of a device that sits inside of a vehicle and constantly monitors the car information. It can determine speed, g-force, and location coordinates. Using these data, the device can detect a car crash or pothole on the road. The data collected from the car is forwarded to a server to for more in-depth… ▽ More The following report contains information about a proposed technology by the authors, which consists of a device that sits inside of a vehicle and constantly monitors the car information. It can determine speed, g-force, and location coordinates. Using these data, the device can detect a car crash or pothole on the road. The data collected from the car is forwarded to a server to for more in-depth analytics. If there is an accident, the server promptly contacts the emergency services with the location of the crash. Moreover, the pothole information is used for analytics of road conditions. △ Less

Submitted 17 November, 2017; originally announced November 2017.

Comments: 8 pages, 7 figures, submitted to "Communication Technology Changing the World Competition", Sponsored by IEEE Communication Society

arXiv:1606.01219 [pdf, other]

Learning Stylometric Representations for Authorship Analysis

Authors: Steven H. H. Ding, Benjamin C. M. Fung, Farkhund Iqbal, William K. Cheung

Abstract: Authorship analysis (AA) is the study of unveiling the hidden properties of authors from a body of exponentially exploding textual data. It extracts an author's identity and sociolinguistic characteristics based on the reflected writing styles in the text. It is an essential process for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most… ▽ More Authorship analysis (AA) is the study of unveiling the hidden properties of authors from a body of exponentially exploding textual data. It extracts an author's identity and sociolinguistic characteristics based on the reflected writing styles in the text. It is an essential process for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most of the previous techniques critically depend on the manual feature engineering process. Consequently, the choice of feature set has been shown to be scenario- or dataset-dependent. In this paper, to mimic the human sentence composition process using a neural network approach, we propose to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for authorship analysis. In particular, the proposed models allow topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics. We evaluate the performance of our approach on the problems of authorship characterization and authorship verification with the Twitter, novel, and essay datasets. The experiments suggest that our proposed text representation outperforms the bag-of-lexical-n-grams, Latent Dirichlet Allocation, Latent Semantic Analysis, PVDM, PVDBOW, and word2vec representations. △ Less

Submitted 3 June, 2016; originally announced June 2016.

ACM Class: K.4.1; I.7.5; I.2.7

arXiv:1405.0198

No Superluminal Signaling Implies Unconditionally Secure Bit Commitment

Authors: H. F. Chau, C. -H. Fred Fung, H. -K. Lo

Abstract: Bit commitment (BC) is an important cryptographic primitive for an agent to convince a mutually mistrustful party that she has already made a binding choice of 0 or 1 but only to reveal her choice at a later time. Ideally, a BC protocol should be simple, reliable, easy to implement using existing technologies, and most importantly unconditionally secure in the sense that its security is based on a… ▽ More Bit commitment (BC) is an important cryptographic primitive for an agent to convince a mutually mistrustful party that she has already made a binding choice of 0 or 1 but only to reveal her choice at a later time. Ideally, a BC protocol should be simple, reliable, easy to implement using existing technologies, and most importantly unconditionally secure in the sense that its security is based on an information-theoretic proof rather than computational complexity assumption or the existence of a trustworthy arbitrator. Here we report such a provably secure scheme involving only one-way classical communications whose unconditional security is based on no superluminal signaling (NSS). Our scheme is inspired by the earlier works by Kent, who proposed two impractical relativistic protocols whose unconditional securities are yet to be established as well as several provably unconditionally secure protocols which rely on both quantum mechanics and NSS. Our scheme is conceptually simple and shows for the first time that quantum communication is not needed to achieve unconditional security for BC. Moreover, with purely classical communications, our scheme is practical and easy to implement with existing telecom technologies. This completes the cycle of study of unconditionally secure bit commitment based on known physical laws. △ Less

Submitted 18 November, 2014; v1 submitted 1 May, 2014; originally announced May 2014.

Comments: This paper has been withdrawn by the authors due to a crucial oversight on an earlier work by A. Kent

arXiv:1208.2773 [pdf, other]

Privacy Preserving Record Linkage via grams Projections

Authors: Luca Bonomi, Li Xiong, Rui Chen, Benjamin C. M. Fung

Abstract: Record linkage has been extensively used in various data mining applications involving sharing data. While the amount of available data is growing, the concern of disclosing sensitive information poses the problem of utility vs privacy. In this paper, we study the problem of private record linkage via secure data transformations. In contrast to the existing techniques in this area, we propose a no… ▽ More Record linkage has been extensively used in various data mining applications involving sharing data. While the amount of available data is growing, the concern of disclosing sensitive information poses the problem of utility vs privacy. In this paper, we study the problem of private record linkage via secure data transformations. In contrast to the existing techniques in this area, we propose a novel approach that provides strong privacy guarantees under the formal framework of differential privacy. We develop an embedding strategy based on frequent variable length grams mined in a private way from the original data. We also introduce personalized threshold for matching individual records in the embedded space which achieves better linkage accuracy than the existing global threshold approach. Compared with the state-of-the-art secure matching schema, our approach provides formal, provable privacy guarantees and achieves better scalability while providing comparable utility. △ Less

Submitted 13 August, 2012; originally announced August 2012.

arXiv:1112.2020 [pdf, ps, other]

Differentially Private Trajectory Data Publication

Authors: Rui Chen, Benjamin C. M. Fung, Bipin C. Desai

Abstract: With the increasing prevalence of location-aware devices, trajectory data has been generated and collected in various application domains. Trajectory data carries rich information that is useful for many data analysis tasks. Yet, improper publishing and use of trajectory data could jeopardize individual privacy. However, it has been shown that existing privacy-preserving trajectory data publishing… ▽ More With the increasing prevalence of location-aware devices, trajectory data has been generated and collected in various application domains. Trajectory data carries rich information that is useful for many data analysis tasks. Yet, improper publishing and use of trajectory data could jeopardize individual privacy. However, it has been shown that existing privacy-preserving trajectory data publishing methods derived from partition-based privacy models, for example k-anonymity, are unable to provide sufficient privacy protection. In this paper, motivated by the data publishing scenario at the Societe de transport de Montreal (STM), the public transit agency in Montreal area, we study the problem of publishing trajectory data under the rigorous differential privacy model. We propose an efficient data-dependent yet differentially private sanitization algorithm, which is applicable to different types of trajectory data. The efficiency of our approach comes from adaptively narrowing down the output domain by building a noisy prefix tree based on the underlying data. Moreover, as a post-processing step, we make use of the inherent constraints of a prefix tree to conduct constrained inferences, which lead to better utility. This is the first paper to introduce a practical solution for publishing large volume of trajectory data under differential privacy. We examine the utility of sanitized data in terms of count queries and frequent sequential pattern mining. Extensive experiments on real-life trajectory data from the STM demonstrate that our approach maintains high utility and is scalable to large trajectory datasets. △ Less

Submitted 9 December, 2011; originally announced December 2011.

arXiv:1002.3190 [pdf, ps, other]

A Distributed Sequential Algorithm for Collaborative Intrusion Detection Networks

Authors: Quanyan Zhu, Carol J. Fung, Raouf Boutaba, Tamer Basar

Abstract: Collaborative intrusion detection networks are often used to gain better detection accuracy and cost efficiency as compared to a single host-based intrusion detection system (IDS). Through cooperation, it is possible for a local IDS to detect new attacks that may be known to other experienced acquaintances. In this paper, we present a sequential hypothesis testing method for feedback aggregation… ▽ More Collaborative intrusion detection networks are often used to gain better detection accuracy and cost efficiency as compared to a single host-based intrusion detection system (IDS). Through cooperation, it is possible for a local IDS to detect new attacks that may be known to other experienced acquaintances. In this paper, we present a sequential hypothesis testing method for feedback aggregation for each individual IDS in the net- work. Our simulation results corroborate our theoretical results and demonstrate the properties of cost efficiency and accuracy compared to other heuristic methods. The analytical result on the lower-bound of the average number of acquaintances for consultation is essential for the design and configuration of IDSs in a collaborative environment. △ Less

Submitted 16 February, 2010; originally announced February 2010.

arXiv:quant-ph/0601115 [pdf, ps, other]

doi 10.1103/PhysRevA.75.032314

Phase-Remapping Attack in Practical Quantum Key Distribution Systems

Authors: Chi-Hang Fred Fung, Bing Qi, Kiyoshi Tamaki, Hoi-Kwong Lo

Abstract: Quantum key distribution (QKD) can be used to generate secret keys between two distant parties. Even though QKD has been proven unconditionally secure against eavesdroppers with unlimited computation power, practical implementations of QKD may contain loopholes that may lead to the generated secret keys being compromised. In this paper, we propose a phase-remapping attack targeting two practical… ▽ More Quantum key distribution (QKD) can be used to generate secret keys between two distant parties. Even though QKD has been proven unconditionally secure against eavesdroppers with unlimited computation power, practical implementations of QKD may contain loopholes that may lead to the generated secret keys being compromised. In this paper, we propose a phase-remapping attack targeting two practical bidirectional QKD systems (the "plug & play" system and the Sagnac system). We showed that if the users of the systems are unaware of our attack, the final key shared between them can be compromised in some situations. Specifically, we showed that, in the case of the Bennett-Brassard 1984 (BB84) protocol with ideal single-photon sources, when the quantum bit error rate (QBER) is between 14.6% and 20%, our attack renders the final key insecure, whereas the same range of QBER values has been proved secure if the two users are unaware of our attack; also, we demonstrated three situations with realistic devices where positive key rates are obtained without the consideration of Trojan horse attacks but in fact no key can be distilled. We remark that our attack is feasible with only current technology. Therefore, it is very important to be aware of our attack in order to ensure absolute security. In finding our attack, we minimize the QBER over individual measurements described by a general POVM, which has some similarity with the standard quantum state discrimination problem. △ Less

Submitted 5 March, 2007; v1 submitted 17 January, 2006; originally announced January 2006.

Comments: 13 pages, 8 figures

Journal ref: Phys. Rev. A 75, 032314 (2007)

Showing 1–33 of 33 results for author: Fung, C