Search | arXiv e-print repository

Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework

Authors: Zackary Rackauckas, Arthur Câmara, Jakub Zavrel

Abstract: Challenges in the automated evaluation of Retrieval-Augmented Generation (RAG) Question-Answering (QA) systems include hallucination problems in domain-specific knowledge and the lack of gold standard benchmarks for company internal tasks. This results in difficulties in evaluating RAG variations, like RAG-Fusion (RAGF), in the context of a product QA task at Infineon Technologies. To solve these… ▽ More Challenges in the automated evaluation of Retrieval-Augmented Generation (RAG) Question-Answering (QA) systems include hallucination problems in domain-specific knowledge and the lack of gold standard benchmarks for company internal tasks. This results in difficulties in evaluating RAG variations, like RAG-Fusion (RAGF), in the context of a product QA task at Infineon Technologies. To solve these problems, we propose a comprehensive evaluation framework, which leverages Large Language Models (LLMs) to generate large datasets of synthetic queries based on real user queries and in-domain documents, uses LLM-as-a-judge to rate retrieved documents and answers, evaluates the quality of answers, and ranks different variants of Retrieval-Augmented Generation (RAG) agents with RAGElo's automated Elo-based competition. LLM-as-a-judge rating of a random sample of synthetic queries shows a moderate, positive correlation with domain expert scoring in relevance, accuracy, completeness, and precision. While RAGF outperformed RAG in Elo score, a significance analysis against expert annotations also shows that RAGF significantly outperforms RAG in completeness, but underperforms in precision. In addition, Infineon's RAGF assistant demonstrated slightly higher performance in document relevance based on MRR@5 scores. We find that RAGElo positively aligns with the preferences of human annotators, though due caution is still required. Finally, RAGF's approach leads to more complete answers based on expert annotations and better answers overall based on RAGElo's evaluation criteria. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: Accepted to LLM4Eval @ SIGIR24

arXiv:2307.04601 [pdf, ps, other]

InPars Toolkit: A Unified and Reproducible Synthetic Data Generation Pipeline for Neural Information Retrieval

Authors: Hugo Abonizio, Luiz Bonifacio, Vitor Jeronymo, Roberto Lotufo, Jakub Zavrel, Rodrigo Nogueira

Abstract: Recent work has explored Large Language Models (LLMs) to overcome the lack of training data for Information Retrieval (IR) tasks. The generalization abilities of these models have enabled the creation of synthetic in-domain data by providing instructions and a few examples on a prompt. InPars and Promptagator have pioneered this approach and both methods have demonstrated the potential of using LL… ▽ More Recent work has explored Large Language Models (LLMs) to overcome the lack of training data for Information Retrieval (IR) tasks. The generalization abilities of these models have enabled the creation of synthetic in-domain data by providing instructions and a few examples on a prompt. InPars and Promptagator have pioneered this approach and both methods have demonstrated the potential of using LLMs as synthetic data generators for IR tasks. This makes them an attractive solution for IR tasks that suffer from a lack of annotated data. However, the reproducibility of these methods was limited, because InPars' training scripts are based on TPUs -- which are not widely accessible -- and because the code for Promptagator was not released and its proprietary LLM is not publicly accessible. To fully realize the potential of these methods and make their impact more widespread in the research community, the resources need to be accessible and easy to reproduce by researchers and practitioners. Our main contribution is a unified toolkit for end-to-end reproducible synthetic data generation research, which includes generation, filtering, training and evaluation. Additionally, we provide an interface to IR libraries widely used by the community and support for GPU. Our toolkit not only reproduces the InPars method and partially reproduces Promptagator, but also provides a plug-and-play functionality allowing the use of different LLMs, exploring filtering methods and finetuning various reranker models on the generated data. We also made available all the synthetic data generated in this work for the 18 different datasets in the BEIR benchmark which took more than 2,000 GPU hours to be generated as well as the reranker models finetuned on the synthetic data. Code and data are available at https://github.com/zetaalphavector/InPars △ Less

Submitted 10 July, 2023; originally announced July 2023.

arXiv:2301.01820 [pdf, ps, other]

InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval

Authors: Vitor Jeronymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Jakub Zavrel, Rodrigo Nogueira

Abstract: Recently, InPars introduced a method to efficiently use large language models (LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents. These synthetic query-document pairs can then be used to train a retriever. However, InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such dataset… ▽ More Recently, InPars introduced a method to efficiently use large language models (LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents. These synthetic query-document pairs can then be used to train a retriever. However, InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such datasets. In this work we introduce InPars-v2, a dataset generator that uses open-source LLMs and existing powerful rerankers to select synthetic query-document pairs for training. A simple BM25 retrieval pipeline followed by a monoT5 reranker finetuned on InPars-v2 data achieves new state-of-the-art results on the BEIR benchmark. To allow researchers to further improve our method, we open source the code, synthetic data, and finetuned models: https://github.com/zetaalphavector/inPars/tree/master/tpu △ Less

Submitted 26 May, 2023; v1 submitted 4 January, 2023; originally announced January 2023.

arXiv:2109.12854 [pdf, other]

Quality Control Methodology for Simulation Models of Computer Network Protocols

Authors: Vladimír Veselý, Jan Zavřel

Abstract: This paper summarizes know-how about modeling and simulation of computer networking protocols we contributed to the OMNeT++ community. We propose a methodology aiming to set a reliable ground truth for the quality of simulation models of networking protocols. We demonstrate the application of this methodology on our EIGRP source code pull-requested to the INET framework. This paper summarizes know-how about modeling and simulation of computer networking protocols we contributed to the OMNeT++ community. We propose a methodology aiming to set a reliable ground truth for the quality of simulation models of networking protocols. We demonstrate the application of this methodology on our EIGRP source code pull-requested to the INET framework. △ Less

Submitted 27 September, 2021; originally announced September 2021.

Comments: Published in: M. Marek, G. Nardini, V. Vesely (Eds.), Proceedings of the 8th OMNeT++ Community Summit, Virtual Summit, September 8-10, 2021

Report number: OMNET/2021/09

arXiv:2011.00061 [pdf, other]

A New Neural Search and Insights Platform for Navigating and Organizing AI Research

Authors: Marzieh Fadaee, Olga Gureenkova, Fernando Rejon Barrera, Carsten Schnober, Wouter Weerkamp, Jakub Zavrel

Abstract: To provide AI researchers with modern tools for dealing with the explosive growth of the research literature in their field, we introduce a new platform, AI Research Navigator, that combines classical keyword search with neural retrieval to discover and organize relevant literature. The system provides search at multiple levels of textual granularity, from sentences to aggregations across document… ▽ More To provide AI researchers with modern tools for dealing with the explosive growth of the research literature in their field, we introduce a new platform, AI Research Navigator, that combines classical keyword search with neural retrieval to discover and organize relevant literature. The system provides search at multiple levels of textual granularity, from sentences to aggregations across documents, both in natural language and through navigation in a domain-specific Knowledge Graph. We give an overview of the overall architecture of the system and of the components for document analysis, question answering, search, analytics, expert search, and recommendations. △ Less

Submitted 30 October, 2020; originally announced November 2020.

Comments: Accepted to Workshop on Scholarly Document Processing (SDP) at EMNLP 2020

arXiv:2010.08269 [pdf, other]

doi 10.18653/v1/2020.sdp-1.7

Effective Distributed Representations for Academic Expert Search

Authors: Mark Berger, Jakub Zavrel, Paul Groth

Abstract: Expert search aims to find and rank experts based on a user's query. In academia, retrieving experts is an efficient way to navigate through a large amount of academic knowledge. Here, we study how different distributed representations of academic papers (i.e. embeddings) impact academic expert retrieval. We use the Microsoft Academic Graph dataset and experiment with different configurations of a… ▽ More Expert search aims to find and rank experts based on a user's query. In academia, retrieving experts is an efficient way to navigate through a large amount of academic knowledge. Here, we study how different distributed representations of academic papers (i.e. embeddings) impact academic expert retrieval. We use the Microsoft Academic Graph dataset and experiment with different configurations of a document-centric voting model for retrieval. In particular, we explore the impact of the use of contextualized embeddings on search performance. We also present results for paper embeddings that incorporate citation information through retrofitting. Additionally, experiments are conducted using different techniques for assigning author weights based on author order. We observe that using contextual embeddings produced by a transformer model trained for sentence similarity tasks produces the most effective paper representations for document-centric expert retrieval. However, retrofitting the paper embeddings and using elaborate author contribution weighting strategies did not improve retrieval performance. △ Less

Submitted 16 October, 2020; originally announced October 2020.

Comments: To be published in the Scholarly Document Processing 2020 Workshop @ EMNLP 2020 proceedings

arXiv:cs/0007018 [pdf, ps, other]

Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers

Authors: Jakub Zavrel, Walter Daelemans

Abstract: This paper describes a new method, Combi-bootstrap, to exploit existing taggers and lexical resources for the annotation of corpora with new tagsets. Combi-bootstrap uses existing resources as features for a second level machine learning module, that is trained to make the mapping to the new tagset on a very small sample of annotated corpus material. Experiments show that Combi-bootstrap: i) can… ▽ More This paper describes a new method, Combi-bootstrap, to exploit existing taggers and lexical resources for the annotation of corpora with new tagsets. Combi-bootstrap uses existing resources as features for a second level machine learning module, that is trained to make the mapping to the new tagset on a very small sample of annotated corpus material. Experiments show that Combi-bootstrap: i) can integrate a wide variety of existing resources, and ii) achieves much higher accuracy (up to 44.7 % error reduction) than both the best single tagger and an ensemble tagger constructed out of the same small training sample. △ Less

Submitted 13 July, 2000; originally announced July 2000.

Comments: 4 pages

ACM Class: I.2.7; I.2.6

Journal ref: Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC 2000), pp. 17--20

arXiv:cs/9812021 [pdf, ps, other]

Forgetting Exceptions is Harmful in Language Learning

Authors: Walter Daelemans, Antal van den Bosch, Jakub Zavrel

Abstract: We show that in language learning, contrary to received wisdom, keeping exceptional training instances in memory can be beneficial for generalization accuracy. We investigate this phenomenon empirically on a selection of benchmark natural language processing tasks: grapheme-to-phoneme conversion, part-of-speech tagging, prepositional-phrase attachment, and base noun phrase chunking. In a first s… ▽ More We show that in language learning, contrary to received wisdom, keeping exceptional training instances in memory can be beneficial for generalization accuracy. We investigate this phenomenon empirically on a selection of benchmark natural language processing tasks: grapheme-to-phoneme conversion, part-of-speech tagging, prepositional-phrase attachment, and base noun phrase chunking. In a first series of experiments we combine memory-based learning with training set editing techniques, in which instances are edited based on their typicality and class prediction strength. Results show that editing exceptional instances (with low typicality or low class prediction strength) tends to harm generalization accuracy. In a second series of experiments we compare memory-based learning and decision-tree learning methods on the same selection of tasks, and find that decision-tree learning often performs worse than memory-based learning. Moreover, the decrease in performance can be linked to the degree of abstraction from exceptions (i.e., pruning or eagerness). We provide explanations for both results in terms of the properties of the natural language processing tasks and the learning algorithms. △ Less

Submitted 22 December, 1998; originally announced December 1998.

Comments: 31 pages, 7 figures, 10 tables. uses 11pt, fullname, a4wide tex styles. Pre-print version of article to appear in Machine Learning 11:1-3, Special Issue on Natural Language Learning. Figures on page 22 slightly compressed to avoid page overload

ACM Class: I.2.6; I.2.7

arXiv:cmp-lg/9807013 [pdf, ps, other]

Improving Data Driven Wordclass Tagging by System Combination

Authors: Hans van Halteren, Jakub Zavrel, Walter Daelemans

Abstract: In this paper we examine how the differences in modelling between different data driven systems performing the same NLP task can be exploited to yield a higher accuracy than the best individual system. We do this by means of an experiment involving the task of morpho-syntactic wordclass tagging. Four well-known tagger generators (Hidden Markov Model, Memory-Based, Transformation Rules and Maximu… ▽ More In this paper we examine how the differences in modelling between different data driven systems performing the same NLP task can be exploited to yield a higher accuracy than the best individual system. We do this by means of an experiment involving the task of morpho-syntactic wordclass tagging. Four well-known tagger generators (Hidden Markov Model, Memory-Based, Transformation Rules and Maximum Entropy) are trained on the same corpus data. After comparison, their outputs are combined using several voting strategies and second stage classifiers. All combination taggers outperform their best component, with the best combination showing a 19.1% lower error rate than the best individual tagger. △ Less

Submitted 31 July, 1998; originally announced July 1998.

Comments: 7 pages, LaTeX, uses acl.bst, colacl.sty

Journal ref: Proceedings of the 17th International Conference on Computational Linguistics (COLING-ACL'98)

arXiv:cmp-lg/9705010 [pdf, ps, other]

Memory-Based Learning: Using Similarity for Smoothing

Authors: Jakub Zavrel, Walter Daelemans

Abstract: This paper analyses the relation between the use of similarity in Memory-Based Learning and the notion of backed-off smoothing in statistical language modeling. We show that the two approaches are closely related, and we argue that feature weighting methods in the Memory-Based paradigm can offer the advantage of automatically specifying a suitable domain-specific hierarchy between most specific… ▽ More This paper analyses the relation between the use of similarity in Memory-Based Learning and the notion of backed-off smoothing in statistical language modeling. We show that the two approaches are closely related, and we argue that feature weighting methods in the Memory-Based paradigm can offer the advantage of automatically specifying a suitable domain-specific hierarchy between most specific and most general conditioning information without the need for a large number of parameters. We report two applications of this approach: PP-attachment and POS-tagging. Our method achieves state-of-the-art performance in both domains, and allows the easy integration of diverse information sources, such as rich lexical representations. △ Less

Submitted 12 May, 1997; originally announced May 1997.

Comments: 8 pages, uses aclap.sty, To appear in Proc. ACL/EACL 97

Report number: ILK-9702

arXiv:cmp-lg/9607012 [pdf, ps, other]

MBT: A Memory-Based Part of Speech Tagger-Generator

Authors: Walter Daelemans, Jakub Zavrel, Peter Berck, Steven Gillis

Abstract: We introduce a memory-based approach to part of speech tagging. Memory-based learning is a form of supervised learning based on similarity-based reasoning. The part of speech tag of a word in a particular context is extrapolated from the most similar cases held in memory. Supervised learning approaches are useful when a tagged corpus is available as an example of the desired output of the tagger… ▽ More We introduce a memory-based approach to part of speech tagging. Memory-based learning is a form of supervised learning based on similarity-based reasoning. The part of speech tag of a word in a particular context is extrapolated from the most similar cases held in memory. Supervised learning approaches are useful when a tagged corpus is available as an example of the desired output of the tagger. Based on such a corpus, the tagger-generator automatically builds a tagger which is able to tag new text the same way, diminishing development time for the construction of a tagger considerably. Memory-based tagging shares this advantage with other statistical or machine learning approaches. Additional advantages specific to a memory-based approach include (i) the relatively small tagged corpus size sufficient for training, (ii) incremental learning, (iii) explanation capabilities, (iv) flexible integration of information in case representations, (v) its non-parametric nature, (vi) reasonably good results on unknown words without morphological analysis, and (vii) fast learning and tagging. In this paper we show that a large-scale application of the memory-based approach is feasible: we obtain a tagging accuracy that is on a par with that of known statistical approaches, and with attractive space and time complexity properties when using {\em IGTree}, a tree-based formalism for indexing and searching huge case bases.} The use of IGTree has as additional advantage that optimal context size for disambiguation is dynamically computed. △ Less

Submitted 11 July, 1996; originally announced July 1996.

Comments: 14 pages, 2 Postscript figures

Journal ref: Proceedings WVLC, Copenhagen

Showing 1–11 of 11 results for author: Zavrel, J