-
Enhancing Retrieval Processes for Language Generation with Augmented Queries
Authors:
Julien Pierre Edmond Ghali,
Kosuke Shima,
Koichi Moriyama,
Atsuko Mutoh,
Nobuhiro Inuzuka
Abstract:
In the rapidly changing world of smart technology, searching for documents has become more challenging due to the rise of advanced language models. These models sometimes face difficulties, like providing inaccurate information, commonly known as "hallucination." This research focuses on addressing this issue through Retrieval-Augmented Generation (RAG), a technique that guides models to give accu…
▽ More
In the rapidly changing world of smart technology, searching for documents has become more challenging due to the rise of advanced language models. These models sometimes face difficulties, like providing inaccurate information, commonly known as "hallucination." This research focuses on addressing this issue through Retrieval-Augmented Generation (RAG), a technique that guides models to give accurate responses based on real facts. To overcome scalability issues, the study explores connecting user queries with sophisticated language models such as BERT and Orca2, using an innovative query optimization process. The study unfolds in three scenarios: first, without RAG, second, without additional assistance, and finally, with extra help. Choosing the compact yet efficient Orca2 7B model demonstrates a smart use of computing resources. The empirical results indicate a significant improvement in the initial language model's performance under RAG, particularly when assisted with prompts augmenters. Consistency in document retrieval across different encodings highlights the effectiveness of using language model-generated queries. The introduction of UMAP for BERT further simplifies document retrieval while maintaining strong results.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
WebRTC-based measurement tool for peer-to-peer applications and preliminary findings with real users
Authors:
Kosuke Nakagawa,
Manabu Tsukada,
Keiichi Shima,
Hiroshi Esaki
Abstract:
Direct peer-to-peer (P2P) communication is often used to minimize the end-to-end latency for real-time applications that require accurate synchronization, such as remote musical ensembles. However, there are few studies on the performance of P2P communication between home network environments, thus hindering the deployment of services that require synchronization. In this study, we developed a P2P…
▽ More
Direct peer-to-peer (P2P) communication is often used to minimize the end-to-end latency for real-time applications that require accurate synchronization, such as remote musical ensembles. However, there are few studies on the performance of P2P communication between home network environments, thus hindering the deployment of services that require synchronization. In this study, we developed a P2P performance measurement tool using the Web Real-Time Communication (WebRTC) statistics application programming interface. Using this tool, we can easily measure P2P performance between home network environments on a web browser without downloading client applications. We also verified the reliability of round-trip time (RTT) measurements using WebRTC and confirmed that our system could provide the necessary measurement accuracy for RTT and jitter measurements for real-time applications. In addition, we measured the performance of a full mesh topology connection with 10 users in an actual environment in Japan. Consequently, we found that only 66% of the peer connections had a latency of 30 ms or less, which is the minimum requirement for high synchronization applications, such as musical ensembles.
△ Less
Submitted 3 December, 2021;
originally announced December 2021.
-
Classification of URL bitstreams using Bag of Bytes
Authors:
Keiichi Shima,
Daisuke Miyamoto,
Hiroshi Abe,
Tomohiro Ishihara,
Kazuya Okada,
Yuji Sekiya,
Hirochika Asai,
Yusuke Doi
Abstract:
Protecting users from accessing malicious web sites is one of the important management tasks for network operators. There are many open-source and commercial products to control web sites users can access. The most traditional approach is blacklist-based filtering. This mechanism is simple but not scalable, though there are some enhanced approaches utilizing fuzzy matching technologies. Other appr…
▽ More
Protecting users from accessing malicious web sites is one of the important management tasks for network operators. There are many open-source and commercial products to control web sites users can access. The most traditional approach is blacklist-based filtering. This mechanism is simple but not scalable, though there are some enhanced approaches utilizing fuzzy matching technologies. Other approaches try to use machine learning (ML) techniques by extracting features from URL strings. This approach can cover a wider area of Internet web sites, but finding good features requires deep knowledge of trends of web site design. Recently, another approach using deep learning (DL) has appeared. The DL approach will help to extract features automatically by investigating a lot of existing sample data. Using this technique, we can build a flexible filtering decision module by keep teaching the neural network module about recent trends, without any specific expert knowledge of the URL domain. In this paper, we apply a mechanical approach to generate feature vectors from URL strings. We implemented our approach and tested with realistic URL access history data taken from a research organization and data from the famous archive site of phishing site information, PhishTank.com. Our approach achieved 2~3% better accuracy compared to the existing DL-based approach.
△ Less
Submitted 11 November, 2021;
originally announced November 2021.
-
Catching Unusual Traffic Behavior using TF-IDF-based Port Access Statistics Analysis
Authors:
Keiichi Shima
Abstract:
Detecting the anomalous behavior of traffic is one of the important actions for network operators. In this study, we applied term frequency - inverse document frequency (TF-IDF), which is a popular method used in natural language processing, to detect unusual behavior from network access logs. We mapped the term and document concept to the port number and daily access history, respectively, and ca…
▽ More
Detecting the anomalous behavior of traffic is one of the important actions for network operators. In this study, we applied term frequency - inverse document frequency (TF-IDF), which is a popular method used in natural language processing, to detect unusual behavior from network access logs. We mapped the term and document concept to the port number and daily access history, respectively, and calculated the TF-IDF. With this approach, we could obtain ports frequently observed in fewer days compared to other port access activities. Such access behaviors are not always malicious activities; however, such information is a good indicator for starting a deeper analysis of traffic behavior. Using a real-life dataset, we could detect two bot-oriented accesses and one unique UDP traffic.
△ Less
Submitted 11 November, 2021;
originally announced November 2021.
-
Classifying DNS Servers based on Response Message Matrix using Machine Learning
Authors:
Keiichi Shima,
Ryo Nakamura,
Kazuya Okada,
Tomohiro Ishihara,
Daisuke Miyamoto,
Yuji Sekiya
Abstract:
Improperly configured domain name system (DNS) servers are sometimes used as packet reflectors as part of a DoS or DDoS attack. Detecting packets created as a result of this activity is logically possible by monitoring the DNS request and response traffic. Any response that does not have a corresponding request can be considered a reflected message; checking and tracking every DNS packet, however,…
▽ More
Improperly configured domain name system (DNS) servers are sometimes used as packet reflectors as part of a DoS or DDoS attack. Detecting packets created as a result of this activity is logically possible by monitoring the DNS request and response traffic. Any response that does not have a corresponding request can be considered a reflected message; checking and tracking every DNS packet, however, is a non-trivial operation. In this paper, we propose a detection mechanism for DNS servers used as reflectors by using a DNS server feature matrix built from a small number of packets and a machine learning algorithm. The F1 score of bad DNS server detection was more than 0.9 when the test and training data are generated within the same day, and more than 0.7 for the data not used for the training and testing phase of the same day.
△ Less
Submitted 9 November, 2021;
originally announced November 2021.
-
A Recurrent Probabilistic Neural Network with Dimensionality Reduction Based on Time-series Discriminant Component Analysis
Authors:
Hideaki Hayashi,
Taro Shibanoki,
Keisuke Shima,
Yuichi Kurita,
Toshio Tsuji
Abstract:
This paper proposes a probabilistic neural network developed on the basis of time-series discriminant component analysis (TSDCA) that can be used to classify high-dimensional time-series patterns. TSDCA involves the compression of high-dimensional time series into a lower-dimensional space using a set of orthogonal transformations and the calculation of posterior probabilities based on a continuou…
▽ More
This paper proposes a probabilistic neural network developed on the basis of time-series discriminant component analysis (TSDCA) that can be used to classify high-dimensional time-series patterns. TSDCA involves the compression of high-dimensional time series into a lower-dimensional space using a set of orthogonal transformations and the calculation of posterior probabilities based on a continuous-density hidden Markov model with a Gaussian mixture model expressed in the reduced-dimensional space. The analysis can be incorporated into a neural network, which is named a time-series discriminant component network (TSDCN), so that parameters of dimensionality reduction and classification can be obtained simultaneously as network coefficients according to a backpropagation through time-based learning algorithm with the Lagrange multiplier method. The TSDCN is considered to enable high-accuracy classification of high-dimensional time-series patterns and to reduce the computation time taken for network training. The validity of the TSDCN is demonstrated for high-dimensional artificial data and EEG signals in the experiments conducted during the study.
△ Less
Submitted 14 November, 2019;
originally announced November 2019.
-
Length Matters: Clustering System Log Messages using Length of Words
Authors:
Keiichi Shima
Abstract:
The analysis techniques of system log messages (syslog messages) have a long history from when the syslog mechanism was invented. Typically, the analysis consists of two parts, one is a message template generation, and the other is finding something interesting using the messages classified by the inferred templates. It is important to generate better templates to achieve better, precise, or convi…
▽ More
The analysis techniques of system log messages (syslog messages) have a long history from when the syslog mechanism was invented. Typically, the analysis consists of two parts, one is a message template generation, and the other is finding something interesting using the messages classified by the inferred templates. It is important to generate better templates to achieve better, precise, or convincible analysis results. In this paper, we propose a classification methodology using the length of words of each message. Our method is suitable for online template generation because it does not require two-pass analysis to generate template messages, that is an important factor considering increasing amount of log messages produced by a large number of system components such as cloud infrastructure.
△ Less
Submitted 10 November, 2016;
originally announced November 2016.