-
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation
Authors:
Dongyu Ru,
Lin Qiu,
Xiangkun Hu,
Tianhang Zhang,
Peng Shi,
Shuaichen Chang,
Cheng Jiayang,
Cunxiang Wang,
Shichao Sun,
Huanyu Li,
Zizhao Zhang,
Binjie Wang,
Jiarong Jiang,
Tong He,
Zhiguo Wang,
Pengfei Liu,
Yue Zhang,
Zheng Zhang
Abstract:
Despite Retrieval-Augmented Generation (RAG) showing promising capability in leveraging external knowledge, a comprehensive evaluation of RAG systems is still challenging due to the modular nature of RAG, evaluation of long-form responses and reliability of measurements. In this paper, we propose a fine-grained evaluation framework, RAGChecker, that incorporates a suite of diagnostic metrics for b…
▽ More
Despite Retrieval-Augmented Generation (RAG) showing promising capability in leveraging external knowledge, a comprehensive evaluation of RAG systems is still challenging due to the modular nature of RAG, evaluation of long-form responses and reliability of measurements. In this paper, we propose a fine-grained evaluation framework, RAGChecker, that incorporates a suite of diagnostic metrics for both the retrieval and generation modules. Meta evaluation verifies that RAGChecker has significantly better correlations with human judgments than other evaluation metrics. Using RAGChecker, we evaluate 8 RAG systems and conduct an in-depth analysis of their performance, revealing insightful patterns and trade-offs in the design choices of RAG architectures. The metrics of RAGChecker can guide researchers and practitioners in developing more effective RAG systems. This work has been open sourced at https://github.com/amazon-science/RAGChecker.
△ Less
Submitted 16 August, 2024; v1 submitted 15 August, 2024;
originally announced August 2024.
-
OpenResearcher: Unleashing AI for Accelerated Scientific Research
Authors:
Yuxiang Zheng,
Shichao Sun,
Lin Qiu,
Dongyu Ru,
Cheng Jiayang,
Xuefeng Li,
Jifan Lin,
Binjie Wang,
Yun Luo,
Renjie Pan,
Yang Xu,
Qingkai Min,
Zizhao Zhang,
Yiwen Wang,
Wenjie Li,
Pengfei Liu
Abstract:
The rapid growth of scientific literature imposes significant challenges for researchers endeavoring to stay updated with the latest advancements in their fields and delve into new areas. We introduce OpenResearcher, an innovative platform that leverages Artificial Intelligence (AI) techniques to accelerate the research process by answering diverse questions from researchers. OpenResearcher is bui…
▽ More
The rapid growth of scientific literature imposes significant challenges for researchers endeavoring to stay updated with the latest advancements in their fields and delve into new areas. We introduce OpenResearcher, an innovative platform that leverages Artificial Intelligence (AI) techniques to accelerate the research process by answering diverse questions from researchers. OpenResearcher is built based on Retrieval-Augmented Generation (RAG) to integrate Large Language Models (LLMs) with up-to-date, domain-specific knowledge. Moreover, we develop various tools for OpenResearcher to understand researchers' queries, search from the scientific literature, filter retrieved information, provide accurate and comprehensive answers, and self-refine these answers. OpenResearcher can flexibly use these tools to balance efficiency and effectiveness. As a result, OpenResearcher enables researchers to save time and increase their potential to discover new insights and drive scientific breakthroughs. Demo, video, and code are available at: https://github.com/GAIR-NLP/OpenResearcher.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
Boosting Scientific Concepts Understanding: Can Analogy from Teacher Models Empower Student Models?
Authors:
Siyu Yuan,
Cheng Jiayang,
Lin Qiu,
Deqing Yang
Abstract:
Analogical reasoning plays a critical role in human cognition, enabling us to understand new concepts by associating them with familiar ones. Previous research in the AI community has mainly focused on identifying and generating analogies and then examining their quality under human evaluation, which overlooks the practical application of these analogies in real-world settings. Inspired by the hum…
▽ More
Analogical reasoning plays a critical role in human cognition, enabling us to understand new concepts by associating them with familiar ones. Previous research in the AI community has mainly focused on identifying and generating analogies and then examining their quality under human evaluation, which overlooks the practical application of these analogies in real-world settings. Inspired by the human education process, in this paper, we propose to investigate how analogies created by teacher language models (LMs) can assist student LMs in understanding scientific concepts, thereby aligning more closely with practical scenarios. Our results suggest that free-form analogies can indeed aid LMs in understanding concepts. Additionally, analogies generated by student LMs can improve their own performance on scientific question answering, demonstrating their capability to use analogies for self-learning new knowledge. Resources are available at https://github.com/siyuyuan/SCUA.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding
Authors:
Chunkit Chan,
Cheng Jiayang,
Yauwai Yim,
Zheye Deng,
Wei Fan,
Haoran Li,
Xin Liu,
Hongming Zhang,
Weiqi Wang,
Yangqiu Song
Abstract:
Large Language Models (LLMs) have sparked substantial interest and debate concerning their potential emergence of Theory of Mind (ToM) ability. Theory of mind evaluations currently focuses on testing models using machine-generated data or game settings prone to shortcuts and spurious correlations, which lacks evaluation of machine ToM ability in real-world human interaction scenarios. This poses a…
▽ More
Large Language Models (LLMs) have sparked substantial interest and debate concerning their potential emergence of Theory of Mind (ToM) ability. Theory of mind evaluations currently focuses on testing models using machine-generated data or game settings prone to shortcuts and spurious correlations, which lacks evaluation of machine ToM ability in real-world human interaction scenarios. This poses a pressing demand to develop new real-world scenario benchmarks. We introduce NegotiationToM, a new benchmark designed to stress-test machine ToM in real-world negotiation surrounding covered multi-dimensional mental states (i.e., desires, beliefs, and intentions). Our benchmark builds upon the Belief-Desire-Intention (BDI) agent modeling theory and conducts the necessary empirical experiments to evaluate large language models. Our findings demonstrate that NegotiationToM is challenging for state-of-the-art LLMs, as they consistently perform significantly worse than humans, even when employing the chain-of-thought (CoT) method.
△ Less
Submitted 4 July, 2024; v1 submitted 21 April, 2024;
originally announced April 2024.
-
EventGround: Narrative Reasoning by Grounding to Eventuality-centric Knowledge Graphs
Authors:
Cheng Jiayang,
Lin Qiu,
Chunkit Chan,
Xin Liu,
Yangqiu Song,
Zheng Zhang
Abstract:
Narrative reasoning relies on the understanding of eventualities in story contexts, which requires a wealth of background world knowledge. To help machines leverage such knowledge, existing solutions can be categorized into two groups. Some focus on implicitly modeling eventuality knowledge by pretraining language models (LMs) with eventuality-aware objectives. However, this approach breaks down k…
▽ More
Narrative reasoning relies on the understanding of eventualities in story contexts, which requires a wealth of background world knowledge. To help machines leverage such knowledge, existing solutions can be categorized into two groups. Some focus on implicitly modeling eventuality knowledge by pretraining language models (LMs) with eventuality-aware objectives. However, this approach breaks down knowledge structures and lacks interpretability. Others explicitly collect world knowledge of eventualities into structured eventuality-centric knowledge graphs (KGs). However, existing research on leveraging these knowledge sources for free-texts is limited. In this work, we propose an initial comprehensive framework called EventGround, which aims to tackle the problem of grounding free-texts to eventuality-centric KGs for contextualized narrative reasoning. We identify two critical problems in this direction: the event representation and sparsity problems. We provide simple yet effective parsing and partial information extraction methods to tackle these problems. Experimental results demonstrate that our approach consistently outperforms baseline models when combined with graph neural network (GNN) or large language model (LLM) based graph reasoning models. Our framework, incorporating grounded knowledge, achieves state-of-the-art performance while providing interpretable evidence.
△ Less
Submitted 7 July, 2024; v1 submitted 29 March, 2024;
originally announced April 2024.
-
StoryAnalogy: Deriving Story-level Analogies from Large Language Models to Unlock Analogical Understanding
Authors:
Cheng Jiayang,
Lin Qiu,
Tsz Ho Chan,
Tianqing Fang,
Weiqi Wang,
Chunkit Chan,
Dongyu Ru,
Qipeng Guo,
Hongming Zhang,
Yangqiu Song,
Yue Zhang,
Zheng Zhang
Abstract:
Analogy-making between narratives is crucial for human reasoning. In this paper, we evaluate the ability to identify and generate analogies by constructing a first-of-its-kind large-scale story-level analogy corpus, \textsc{StoryAnalogy}, which contains 24K story pairs from diverse domains with human annotations on two similarities from the extended Structure-Mapping Theory. We design a set of tes…
▽ More
Analogy-making between narratives is crucial for human reasoning. In this paper, we evaluate the ability to identify and generate analogies by constructing a first-of-its-kind large-scale story-level analogy corpus, \textsc{StoryAnalogy}, which contains 24K story pairs from diverse domains with human annotations on two similarities from the extended Structure-Mapping Theory. We design a set of tests on \textsc{StoryAnalogy}, presenting the first evaluation of story-level analogy identification and generation. Interestingly, we find that the analogy identification tasks are incredibly difficult not only for sentence embedding models but also for the recent large language models (LLMs) such as ChatGPT and LLaMa. ChatGPT, for example, only achieved around 30% accuracy in multiple-choice questions (compared to over 85% accuracy for humans). Furthermore, we observe that the data in \textsc{StoryAnalogy} can improve the quality of analogy generation in LLMs, where a fine-tuned FlanT5-xxl model achieves comparable performance to zero-shot ChatGPT.
△ Less
Submitted 23 October, 2023; v1 submitted 19 October, 2023;
originally announced October 2023.
-
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity
Authors:
Cunxiang Wang,
Xiaoze Liu,
Yuanhao Yue,
Xiangru Tang,
Tianhang Zhang,
Cheng Jiayang,
Yunzhi Yao,
Wenyang Gao,
Xuming Hu,
Zehan Qi,
Yidong Wang,
Linyi Yang,
Jindong Wang,
Xing Xie,
Zheng Zhang,
Yue Zhang
Abstract:
This survey addresses the crucial issue of factuality in Large Language Models (LLMs). As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital. We define the Factuality Issue as the probability of LLMs to produce content inconsistent with established facts. We first delve into the implications of these inaccuracies, highlighting the potential co…
▽ More
This survey addresses the crucial issue of factuality in Large Language Models (LLMs). As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital. We define the Factuality Issue as the probability of LLMs to produce content inconsistent with established facts. We first delve into the implications of these inaccuracies, highlighting the potential consequences and challenges posed by factual errors in LLM outputs. Subsequently, we analyze the mechanisms through which LLMs store and process facts, seeking the primary causes of factual errors. Our discussion then transitions to methodologies for evaluating LLM factuality, emphasizing key metrics, benchmarks, and studies. We further explore strategies for enhancing LLM factuality, including approaches tailored for specific domains. We focus two primary LLM configurations standalone LLMs and Retrieval-Augmented LLMs that utilizes external data, we detail their unique challenges and potential enhancements. Our survey offers a structured guide for researchers aiming to fortify the factual reliability of LLMs.
△ Less
Submitted 16 December, 2023; v1 submitted 11 October, 2023;
originally announced October 2023.