-
Enhancing Q&A with Domain-Specific Fine-Tuning and Iterative Reasoning: A Comparative Study
Authors:
Zooey Nguyen,
Anthony Annunziata,
Vinh Luong,
Sang Dinh,
Quynh Le,
Anh Hai Ha,
Chanh Le,
Hong An Phan,
Shruti Raghavan,
Christopher Nguyen
Abstract:
This paper investigates the impact of domain-specific model fine-tuning and of reasoning mechanisms on the performance of question-answering (Q&A) systems powered by large language models (LLMs) and Retrieval-Augmented Generation (RAG). Using the FinanceBench SEC financial filings dataset, we observe that, for RAG, combining a fine-tuned embedding model with a fine-tuned LLM achieves better accura…
▽ More
This paper investigates the impact of domain-specific model fine-tuning and of reasoning mechanisms on the performance of question-answering (Q&A) systems powered by large language models (LLMs) and Retrieval-Augmented Generation (RAG). Using the FinanceBench SEC financial filings dataset, we observe that, for RAG, combining a fine-tuned embedding model with a fine-tuned LLM achieves better accuracy than generic models, with relatively greater gains attributable to fine-tuned embedding models. Additionally, employing reasoning iterations on top of RAG delivers an even bigger jump in performance, enabling the Q&A systems to get closer to human-expert quality. We discuss the implications of such findings, propose a structured technical design space capturing major technical components of Q&A AI, and provide recommendations for making high-impact technical choices for such components. We plan to follow up on this work with actionable guides for AI teams and further investigations into the impact of domain-specific augmentation in RAG and into agentic AI capabilities such as advanced planning and reasoning.
△ Less
Submitted 19 April, 2024; v1 submitted 17 April, 2024;
originally announced April 2024.
-
Streaming Active Deep Forest for Evolving Data Stream Classification
Authors:
Anh Vu Luong,
Tien Thanh Nguyen,
Alan Wee-Chung Liew
Abstract:
In recent years, Deep Neural Networks (DNNs) have gained progressive momentum in many areas of machine learning. The layer-by-layer process of DNNs has inspired the development of many deep models, including deep ensembles. The most notable deep ensemble-based model is Deep Forest, which can achieve highly competitive performance while having much fewer hyper-parameters comparing to DNNs. In spite…
▽ More
In recent years, Deep Neural Networks (DNNs) have gained progressive momentum in many areas of machine learning. The layer-by-layer process of DNNs has inspired the development of many deep models, including deep ensembles. The most notable deep ensemble-based model is Deep Forest, which can achieve highly competitive performance while having much fewer hyper-parameters comparing to DNNs. In spite of its huge success in the batch learning setting, no effort has been made to adapt Deep Forest to the context of evolving data streams. In this work, we introduce the Streaming Deep Forest (SDF) algorithm, a high-performance deep ensemble method specially adapted to stream classification. We also present the Augmented Variable Uncertainty (AVU) active learning strategy to reduce the labeling cost in the streaming context. We compare the proposed methods to state-of-the-art streaming algorithms in a wide range of datasets. The results show that by following the AVU active learning strategy, SDF with only 70\% of labeling budget significantly outperforms other methods trained with all instances.
△ Less
Submitted 26 February, 2020;
originally announced February 2020.
-
An improvement on fragmentation in Distribution Database Design Based on Knowledge-Oriented Clustering Techniques
Authors:
Van Nghia Luong,
Ha Huy Cuong Nguyen,
Van Son Le
Abstract:
The problem of optimizing distributed database includes: fragmentation and positioning data. Several different approaches and algorithms have been proposed to solve this problem. In this paper, we propose an algorithm that builds the initial equivalence relation based on the distance threshold. This threshold is also based on knowledge- oriented clustering techniques for both of horizontal and ver…
▽ More
The problem of optimizing distributed database includes: fragmentation and positioning data. Several different approaches and algorithms have been proposed to solve this problem. In this paper, we propose an algorithm that builds the initial equivalence relation based on the distance threshold. This threshold is also based on knowledge- oriented clustering techniques for both of horizontal and vertical fragmentation. Similarity measures used in the algorithms are the measures developed from the classical measures. Experimental results carrying on the small data set match fragmented results based on the classical algorithm. Execution time and data fragmentation significantly reduced while the complexity of our algorithm in the general case is stable.
△ Less
Submitted 6 May, 2015;
originally announced May 2015.