Search | arXiv e-print repository

Learning to Generate Answers with Citations via Factual Consistency Models

Authors: Rami Aly, Zhiqiang Tang, Samson Tan, George Karypis

Abstract: Large Language Models (LLMs) frequently hallucinate, impeding their reliability in mission-critical situations. One approach to address this issue is to provide citations to relevant sources alongside generated content, enhancing the verifiability of generations. However, citing passages accurately in answers remains a substantial challenge. This paper proposes a weakly-supervised fine-tuning meth… ▽ More Large Language Models (LLMs) frequently hallucinate, impeding their reliability in mission-critical situations. One approach to address this issue is to provide citations to relevant sources alongside generated content, enhancing the verifiability of generations. However, citing passages accurately in answers remains a substantial challenge. This paper proposes a weakly-supervised fine-tuning method leveraging factual consistency models (FCMs). Our approach alternates between generating texts with citations and supervised fine-tuning with FCM-filtered citation data. Focused learning is integrated into the objective, directing the fine-tuning process to emphasise the factual unit tokens, as measured by an FCM. Results on the ALCE few-shot citation benchmark with various instruction-tuned LLMs demonstrate superior performance compared to in-context learning, vanilla supervised fine-tuning, and state-of-the-art methods, with an average improvement of $34.1$, $15.5$, and $10.5$ citation F$_1$ points, respectively. Moreover, in a domain transfer setting we show that the obtained citation generation ability robustly transfers to unseen datasets. Notably, our citation improvements contribute to the lowest factual error rate across baselines. △ Less

Submitted 15 July, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

Comments: Accepted to ACL 2024. Code is available at https://github.com/amazon-science/learning-to-generate-answers-with-citations

arXiv:2404.03818 [pdf, other]

PRobELM: Plausibility Ranking Evaluation for Language Models

Authors: Zhangdie Yuan, Eric Chamoun, Rami Aly, Chenxi Whitehouse, Andreas Vlachos

Abstract: This paper introduces PRobELM (Plausibility Ranking Evaluation for Language Models), a benchmark designed to assess language models' ability to discern more plausible from less plausible scenarios through their parametric knowledge. While benchmarks such as TruthfulQA emphasise factual accuracy or truthfulness, and others such as COPA explore plausible scenarios without explicitly incorporating wo… ▽ More This paper introduces PRobELM (Plausibility Ranking Evaluation for Language Models), a benchmark designed to assess language models' ability to discern more plausible from less plausible scenarios through their parametric knowledge. While benchmarks such as TruthfulQA emphasise factual accuracy or truthfulness, and others such as COPA explore plausible scenarios without explicitly incorporating world knowledge, PRobELM seeks to bridge this gap by evaluating models' capabilities to prioritise plausible scenarios that leverage world knowledge over less plausible alternatives. This design allows us to assess the potential of language models for downstream use cases such as literature-based discovery where the focus is on identifying information that is likely but not yet known. Our benchmark is constructed from a dataset curated from Wikidata edit histories, tailored to align the temporal bounds of the training data for the evaluated models. PRobELM facilitates the evaluation of language models across multiple prompting types, including statement, text completion, and question-answering. Experiments with 10 models of various sizes and architectures on the relationship between model scales, training recency, and plausibility performance, reveal that factual accuracy does not directly correlate with plausibility performance and that up-to-date training data enhances plausibility assessment across different model architectures. △ Less

Submitted 7 August, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

arXiv:2310.14198 [pdf, other]

QA-NatVer: Question Answering for Natural Logic-based Fact Verification

Authors: Rami Aly, Marek Strong, Andreas Vlachos

Abstract: Fact verification systems assess a claim's veracity based on evidence. An important consideration in designing them is faithfulness, i.e. generating explanations that accurately reflect the reasoning of the model. Recent works have focused on natural logic, which operates directly on natural language by capturing the semantic relation of spans between an aligned claim with its evidence via set-the… ▽ More Fact verification systems assess a claim's veracity based on evidence. An important consideration in designing them is faithfulness, i.e. generating explanations that accurately reflect the reasoning of the model. Recent works have focused on natural logic, which operates directly on natural language by capturing the semantic relation of spans between an aligned claim with its evidence via set-theoretic operators. However, these approaches rely on substantial resources for training, which are only available for high-resource languages. To this end, we propose to use question answering to predict natural logic operators, taking advantage of the generalization capabilities of instruction-tuned language models. Thus, we obviate the need for annotated training data while still relying on a deterministic inference system. In a few-shot setting on FEVER, our approach outperforms the best baseline by $4.3$ accuracy points, including a state-of-the-art pre-trained seq2seq natural logic system, as well as a state-of-the-art prompt-based classifier. Our system demonstrates its robustness and portability, achieving competitive performance on a counterfactual dataset and surpassing all approaches without further annotation on a Danish verification dataset. A human evaluation indicates that our approach produces more plausible proofs with fewer erroneous natural logic operators than previous natural logic-based systems. △ Less

Submitted 22 October, 2023; originally announced October 2023.

Comments: EMNLP 2023

arXiv:2305.12576 [pdf, other]

Automated Few-shot Classification with Instruction-Finetuned Language Models

Authors: Rami Aly, Xingjian Shi, Kaixiang Lin, Aston Zhang, Andrew Gordon Wilson

Abstract: A particularly successful class of approaches for few-shot learning combines language models with prompts -- hand-crafted task descriptions that complement data samples. However, designing prompts by hand for each task commonly requires domain knowledge and substantial guesswork. We observe, in the context of classification tasks, that instruction finetuned language models exhibit remarkable promp… ▽ More A particularly successful class of approaches for few-shot learning combines language models with prompts -- hand-crafted task descriptions that complement data samples. However, designing prompts by hand for each task commonly requires domain knowledge and substantial guesswork. We observe, in the context of classification tasks, that instruction finetuned language models exhibit remarkable prompt robustness, and we subsequently propose a simple method to eliminate the need for handcrafted prompts, named AuT-Few. This approach consists of (i) a prompt retrieval module that selects suitable task instructions from the instruction-tuning knowledge base, and (ii) the generation of two distinct, semantically meaningful, class descriptions and a selection mechanism via cross-validation. Over $12$ datasets, spanning $8$ classification tasks, we show that AuT-Few outperforms current state-of-the-art few-shot learning methods. Moreover, AuT-Few is the best ranking method across datasets on the RAFT few-shot benchmark. Notably, these results are achieved without task-specific handcrafted prompts on unseen tasks. △ Less

Submitted 21 October, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

Comments: EMNLP2023 Findings

arXiv:2212.05276 [pdf, other]

Natural Logic-guided Autoregressive Multi-hop Document Retrieval for Fact Verification

Authors: Rami Aly, Andreas Vlachos

Abstract: A key component of fact verification is thevevidence retrieval, often from multiple documents. Recent approaches use dense representations and condition the retrieval of each document on the previously retrieved ones. The latter step is performed over all the documents in the collection, requiring storing their dense representations in an index, thus incurring a high memory footprint. An alternati… ▽ More A key component of fact verification is thevevidence retrieval, often from multiple documents. Recent approaches use dense representations and condition the retrieval of each document on the previously retrieved ones. The latter step is performed over all the documents in the collection, requiring storing their dense representations in an index, thus incurring a high memory footprint. An alternative paradigm is retrieve-and-rerank, where documents are retrieved using methods such as BM25, their sentences are reranked, and further documents are retrieved conditioned on these sentences, reducing the memory requirements. However, such approaches can be brittle as they rely on heuristics and assume hyperlinks between documents. We propose a novel retrieve-and-rerank method for multi-hop retrieval, that consists of a retriever that jointly scores documents in the knowledge source and sentences from previously retrieved documents using an autoregressive formulation and is guided by a proof system based on natural logic that dynamically terminates the retrieval process if the evidence is deemed sufficient. This method is competitive with current state-of-the-art methods on FEVER, HoVer and FEVEROUS-S, while using $5$ to $10$ times less memory than competing systems. Evaluation on an adversarial dataset indicates improved stability of our approach compared to commonly deployed threshold-based methods. Finally, the proof system helps humans predict model decisions correctly more often than using the evidence alone. △ Less

Submitted 10 December, 2022; originally announced December 2022.

Comments: EMNLP2022

arXiv:2206.04449 [pdf, other]

Segmentation Enhanced Lameness Detection in Dairy Cows from RGB and Depth Video

Authors: Eric Arazo, Robin Aly, Kevin McGuinness

Abstract: Cow lameness is a severe condition that affects the life cycle and life quality of dairy cows and results in considerable economic losses. Early lameness detection helps farmers address illnesses early and avoid negative effects caused by the degeneration of cows' condition. We collected a dataset of short clips of cows passing through a hallway exiting a milking station and annotated the degree o… ▽ More Cow lameness is a severe condition that affects the life cycle and life quality of dairy cows and results in considerable economic losses. Early lameness detection helps farmers address illnesses early and avoid negative effects caused by the degeneration of cows' condition. We collected a dataset of short clips of cows passing through a hallway exiting a milking station and annotated the degree of lameness of the cows. This paper explores the resulting dataset and provides a detailed description of the data collection process. Additionally, we proposed a lameness detection method that leverages pre-trained neural networks to extract discriminative features from videos and assign a binary score to each cow indicating its condition: "healthy" or "lame." We improve this approach by forcing the model to focus on the structure of the cow, which we achieve by substituting the RGB videos with binary segmentation masks predicted with a trained segmentation model. This work aims to encourage research and provide insights into the applicability of computer vision models for cow lameness detection on farms. △ Less

Submitted 9 June, 2022; originally announced June 2022.

Comments: Accepted at the CV4Animals workshop in CVPR 2022

arXiv:2106.05707 [pdf, other]

FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information

Authors: Rami Aly, Zhijiang Guo, Michael Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, Arpit Mittal

Abstract: Fact verification has attracted a lot of attention in the machine learning and natural language processing communities, as it is one of the key methods for detecting misinformation. Existing large-scale benchmarks for this task have focused mostly on textual sources, i.e. unstructured information, and thus ignored the wealth of information available in structured formats, such as tables. In this p… ▽ More Fact verification has attracted a lot of attention in the machine learning and natural language processing communities, as it is one of the key methods for detecting misinformation. Existing large-scale benchmarks for this task have focused mostly on textual sources, i.e. unstructured information, and thus ignored the wealth of information available in structured formats, such as tables. In this paper we introduce a novel dataset and benchmark, Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS), which consists of 87,026 verified claims. Each claim is annotated with evidence in the form of sentences and/or cells from tables in Wikipedia, as well as a label indicating whether this evidence supports, refutes, or does not provide enough information to reach a verdict. Furthermore, we detail our efforts to track and minimize the biases present in the dataset and could be exploited by models, e.g. being able to predict the label without using evidence. Finally, we develop a baseline for verifying claims against text and tables which predicts both the correct evidence and verdict for 18% of the claims. △ Less

Submitted 12 October, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

Comments: Accepted at NeurIPS 2021 Datasets and Benchmarks Track

arXiv:1906.02002 [pdf, other]

Every child should have parents: a taxonomy refinement algorithm based on hyperbolic term embeddings

Authors: Rami Aly, Shantanu Acharya, Alexander Ossa, Arne Köhn, Chris Biemann, Alexander Panchenko

Abstract: We introduce the use of Poincaré embeddings to improve existing state-of-the-art approaches to domain-specific taxonomy induction from text as a signal for both relocating wrong hyponym terms within a (pre-induced) taxonomy as well as for attaching disconnected terms in a taxonomy. This method substantially improves previous state-of-the-art results on the SemEval-2016 Task 13 on taxonomy extracti… ▽ More We introduce the use of Poincaré embeddings to improve existing state-of-the-art approaches to domain-specific taxonomy induction from text as a signal for both relocating wrong hyponym terms within a (pre-induced) taxonomy as well as for attaching disconnected terms in a taxonomy. This method substantially improves previous state-of-the-art results on the SemEval-2016 Task 13 on taxonomy extraction. We demonstrate the superiority of Poincaré embeddings over distributional semantic representations, supporting the hypothesis that they can better capture hierarchical lexical-semantic relationships than embeddings in the Euclidean space. △ Less

Submitted 5 June, 2019; originally announced June 2019.

Comments: 7 pages (5 + 2 pages references), 2 Figures, 3 Tables, Accepted to the ACL 2019 conference. Will appear in its proceedings

arXiv:1611.03660 [pdf, other]

Using text mining and machine learning for detection of child abuse

Authors: Chintan Amrit, Tim Paauw, Robin Aly, Miha Lavric

Abstract: Abuse in any form is a grave threat to a child's health. Public health institutions in the Netherlands try to identify and prevent different kinds of abuse, and building a decision support system can help such institutions achieve this goal. Such decision support relies on the analysis of relevant child health data. A significant part of the medical data that the institutions have on children is u… ▽ More Abuse in any form is a grave threat to a child's health. Public health institutions in the Netherlands try to identify and prevent different kinds of abuse, and building a decision support system can help such institutions achieve this goal. Such decision support relies on the analysis of relevant child health data. A significant part of the medical data that the institutions have on children is unstructured, and in the form of free text notes. In this research, we employ machine learning and text mining techniques to detect patterns of possible child abuse in the data. The resulting model achieves a high score in classifying cases of possible abuse. We then describe our implementation of the decision support API at a municipality in the Netherlands. △ Less

Submitted 16 November, 2016; v1 submitted 11 November, 2016; originally announced November 2016.

Comments: 31 pages, 7 figures and 12 tables

ACM Class: H.4.2; I.2.7

arXiv:1511.07237 [pdf, other]

doi 10.1007/s10791-015-9275-x

Predicting Relevance based on Assessor Disagreement: Analysis and Practical Applications for Search Evaluation

Authors: Thomas Demeester, Robin Aly, Djoerd Hiemstra, Dong Nguyen, Chris Develder

Abstract: Evaluation of search engines relies on assessments of search results for selected test queries, from which we would ideally like to draw conclusions in terms of relevance of the results for general (e.g., future, unknown) users. In practice however, most evaluation scenarios only allow us to conclusively determine the relevance towards the particular assessor that provided the judgments. A factor… ▽ More Evaluation of search engines relies on assessments of search results for selected test queries, from which we would ideally like to draw conclusions in terms of relevance of the results for general (e.g., future, unknown) users. In practice however, most evaluation scenarios only allow us to conclusively determine the relevance towards the particular assessor that provided the judgments. A factor that cannot be ignored when extending conclusions made from assessors towards users, is the possible disagreement on relevance, assuming that a single gold truth label does not exist. This paper presents and analyzes the Predicted Relevance Model (PRM), which allows predicting a particular result's relevance for a random user, based on an observed assessment and knowledge on the average disagreement between assessors. With the PRM, existing evaluation metrics designed to measure binary assessor relevance, can be transformed into more robust and effectively graded measures that evaluate relevance towards a random user. It also leads to a principled way of quantifying multiple graded or categorical relevance levels for use as gains in established graded relevance measures, such as normalized discounted cumulative gain (nDCG), which nowadays often use heuristic and data-independent gain values. Given a set of test topics with graded relevance judgments, the PRM allows evaluating systems on different scenarios, such as their capability of retrieving top results, or how well they are able to filter out non-relevant ones. Its use in actual evaluation scenarios is illustrated on several information retrieval test collections. △ Less

Submitted 23 November, 2015; originally announced November 2015.

Comments: Accepted for publication in Springer Information Retrieval Journal, special issue on Information Retrieval Evaluation using Test Collections

arXiv:1312.1913 [pdf, other]

Adapting Binary Information Retrieval Evaluation Metrics for Segment-based Retrieval Tasks

Authors: Robin Aly, Maria Eskevich, Roeland Ordelman, Gareth J. F. Jones

Abstract: This report describes metrics for the evaluation of the effectiveness of segment-based retrieval based on existing binary information retrieval metrics. This metrics are described in the context of a task for the hyperlinking of video segments. This evaluation approach re-uses existing evaluation measures from the standard Cranfield evaluation paradigm. Our adaptation approach can in principle be… ▽ More This report describes metrics for the evaluation of the effectiveness of segment-based retrieval based on existing binary information retrieval metrics. This metrics are described in the context of a task for the hyperlinking of video segments. This evaluation approach re-uses existing evaluation measures from the standard Cranfield evaluation paradigm. Our adaptation approach can in principle be used with any kind of effectiveness measure that uses binary relevance, and for other segment-baed retrieval tasks. In our video hyperlinking setting, we use precision at a cut-off rank n and mean average precision. △ Less

Submitted 6 December, 2013; originally announced December 2013.

Comments: Explanation of evaluation measures for the linking task of the MediaEval Workshop 2013

Showing 1–11 of 11 results for author: Aly, R