Search | arXiv e-print repository

AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents

Authors: Luca Gioacchini, Giuseppe Siracusano, Davide Sanvito, Kiril Gashteovski, David Friede, Roberto Bifulco, Carolin Lawrence

Abstract: The advances made by Large Language Models (LLMs) have led to the pursuit of LLM agents that can solve intricate, multi-step reasoning tasks. As with any research pursuit, benchmarking and evaluation are key corner stones to efficient and reliable progress. However, existing benchmarks are often narrow and simply compute overall task success. To face these issues, we propose AgentQuest -- a framew… ▽ More The advances made by Large Language Models (LLMs) have led to the pursuit of LLM agents that can solve intricate, multi-step reasoning tasks. As with any research pursuit, benchmarking and evaluation are key corner stones to efficient and reliable progress. However, existing benchmarks are often narrow and simply compute overall task success. To face these issues, we propose AgentQuest -- a framework where (i) both benchmarks and metrics are modular and easily extensible through well documented and easy-to-use APIs; (ii) we offer two new evaluation metrics that can reliably track LLM agent progress while solving a task. We exemplify the utility of the metrics on two use cases wherein we identify common failure points and refine the agent architecture to obtain a significant performance increase. Together with the research community, we hope to extend AgentQuest further and therefore we make it available under https://github.com/nec-research/agentquest. △ Less

Submitted 9 April, 2024; originally announced April 2024.

Comments: Accepted at the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2024)

arXiv:2311.14966 [pdf, other]

Walking a Tightrope -- Evaluating Large Language Models in High-Risk Domains

Authors: Chia-Chien Hung, Wiem Ben Rim, Lindsay Frost, Lars Bruckner, Carolin Lawrence

Abstract: High-risk domains pose unique challenges that require language models to provide accurate and safe responses. Despite the great success of large language models (LLMs), such as ChatGPT and its variants, their performance in high-risk domains remains unclear. Our study delves into an in-depth analysis of the performance of instruction-tuned LLMs, focusing on factual accuracy and safety adherence. T… ▽ More High-risk domains pose unique challenges that require language models to provide accurate and safe responses. Despite the great success of large language models (LLMs), such as ChatGPT and its variants, their performance in high-risk domains remains unclear. Our study delves into an in-depth analysis of the performance of instruction-tuned LLMs, focusing on factual accuracy and safety adherence. To comprehensively assess the capabilities of LLMs, we conduct experiments on six NLP datasets including question answering and summarization tasks within two high-risk domains: legal and medical. Further qualitative analysis highlights the existing limitations inherent in current LLMs when evaluating in high-risk domains. This underscores the essential nature of not only improving LLM capabilities but also prioritizing the refinement of domain-specific metrics, and embracing a more human-centric approach to enhance safety and factual reliability. Our findings advance the field toward the concerns of properly evaluating LLMs in high-risk domains, aiming to steer the adaptability of LLMs in fulfilling societal obligations and aligning with forthcoming regulations, such as the EU AI Act. △ Less

Submitted 25 November, 2023; originally announced November 2023.

Comments: EMNLP 2023 Workshop on Benchmarking Generalisation in NLP (GenBench)

arXiv:2310.14909 [pdf, other]

Linking Surface Facts to Large-Scale Knowledge Graphs

Authors: Gorjan Radevski, Kiril Gashteovski, Chia-Chien Hung, Carolin Lawrence, Goran Glavaš

Abstract: Open Information Extraction (OIE) methods extract facts from natural language text in the form of ("subject"; "relation"; "object") triples. These facts are, however, merely surface forms, the ambiguity of which impedes their downstream usage; e.g., the surface phrase "Michael Jordan" may refer to either the former basketball player or the university professor. Knowledge Graphs (KGs), on the other… ▽ More Open Information Extraction (OIE) methods extract facts from natural language text in the form of ("subject"; "relation"; "object") triples. These facts are, however, merely surface forms, the ambiguity of which impedes their downstream usage; e.g., the surface phrase "Michael Jordan" may refer to either the former basketball player or the university professor. Knowledge Graphs (KGs), on the other hand, contain facts in a canonical (i.e., unambiguous) form, but their coverage is limited by a static schema (i.e., a fixed set of entities and predicates). To bridge this gap, we need the best of both worlds: (i) high coverage of free-text OIEs, and (ii) semantic precision (i.e., monosemy) of KGs. In order to achieve this goal, we propose a new benchmark with novel evaluation protocols that can, for example, measure fact linking performance on a granular triple slot level, while also measuring if a system has the ability to recognize that a surface form has no match in the existing KG. Our extensive evaluation of several baselines show that detection of out-of-KG entities and predicates is more difficult than accurate linking to existing ones, thus calling for more research efforts on this difficult task. We publicly release all resources (data, benchmark and code) on https://github.com/nec-research/fact-linking. △ Less

Submitted 23 October, 2023; originally announced October 2023.

arXiv:2310.07333 [pdf, ps, other]

Computing approximate roots of monotone functions

Authors: Alexandros Hollender, Chester Lawrence, Erel Segal-Halevi

Abstract: Given a function f: [a,b] -> R, if f(a) < 0 and f(b)> 0 and f is continuous, the Intermediate Value Theorem implies that f has a root in [a,b]. Moreover, given a value-oracle for f, an approximate root of f can be computed using the bisection method, and the number of required evaluations is polynomial in the number of accuracy digits. The goal of this note is to identify conditions under which th… ▽ More Given a function f: [a,b] -> R, if f(a) < 0 and f(b)> 0 and f is continuous, the Intermediate Value Theorem implies that f has a root in [a,b]. Moreover, given a value-oracle for f, an approximate root of f can be computed using the bisection method, and the number of required evaluations is polynomial in the number of accuracy digits. The goal of this note is to identify conditions under which this polynomiality result extends to a multi-dimensional function that satisfies the conditions of Miranda's theorem -- the natural multi-dimensional extension of the Intermediate Value Theorem. In general, finding an approximate root might require an exponential number of evaluations even for a two-dimensional function. We show that, if f is two-dimensional and satisfies a single monotonicity condition, then the number of required evaluations is polynomial in the accuracy. For any fixed dimension d, if f is a d-dimensional function that satisfies all d^2-d ``ex-diagonal'' monotonicity conditions (that is, component i of f is monotonically decreasing with respect to variable j for all i!=j), then the number of required evaluations is polynomial in the accuracy. But if f satisfies only d^2-d-2 ex-diagonal conditions, then the number of required evaluations may be exponential in the accuracy. The case of d^2-d-1 ex-diagonal conditions remains unsolved. As an example application, we show that computing approximate roots of monotone functions can be used for approximate envy-free cake-cutting. △ Less

Submitted 29 February, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

Comments: We solved all the open cases, except the case when f has 3 or more dimensions, and satisfies all monotonicity conditions except one. Any ideas?

arXiv:2307.00524 [pdf, other]

Large Language Models Enable Few-Shot Clustering

Authors: Vijay Viswanathan, Kiril Gashteovski, Carolin Lawrence, Tongshuang Wu, Graham Neubig

Abstract: Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user's intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert to improve the clusters. In this paper, we ask whether a large language model can amplify an expert'… ▽ More Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user's intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert to improve the clusters. In this paper, we ask whether a large language model can amplify an expert's guidance to enable query-efficient, few-shot semi-supervised text clustering. We show that LLMs are surprisingly effective at improving clustering. We explore three stages where LLMs can be incorporated into clustering: before clustering (improving input features), during clustering (by providing constraints to the clusterer), and after clustering (using LLMs post-correction). We find incorporating LLMs in the first two stages can routinely provide significant improvements in cluster quality, and that LLMs enable a user to make trade-offs between cost and accuracy to produce desired clusters. We release our code and LLM prompts for the public to use. △ Less

Submitted 2 July, 2023; originally announced July 2023.

arXiv:2304.00918 [pdf, other]

doi 10.1109/ICDM54844.2022.00167

Uncertainty Propagation in Node Classification

Authors: Zhao Xu, Carolin Lawrence, Ammar Shaker, Raman Siarheyeu

Abstract: Quantifying predictive uncertainty of neural networks has recently attracted increasing attention. In this work, we focus on measuring uncertainty of graph neural networks (GNNs) for the task of node classification. Most existing GNNs model message passing among nodes. The messages are often deterministic. Questions naturally arise: Does there exist uncertainty in the messages? How could we propag… ▽ More Quantifying predictive uncertainty of neural networks has recently attracted increasing attention. In this work, we focus on measuring uncertainty of graph neural networks (GNNs) for the task of node classification. Most existing GNNs model message passing among nodes. The messages are often deterministic. Questions naturally arise: Does there exist uncertainty in the messages? How could we propagate such uncertainty over a graph together with messages? To address these issues, we propose a Bayesian uncertainty propagation (BUP) method, which embeds GNNs in a Bayesian modeling framework, and models predictive uncertainty of node classification with Bayesian confidence of predictive probability and uncertainty of messages. Our method proposes a novel uncertainty propagation mechanism inspired by Gaussian models. Moreover, we present an uncertainty oriented loss for node classification that allows the GNNs to clearly integrate predictive uncertainty in learning procedure. Consequently, the training examples with large predictive uncertainty will be penalized. We demonstrate the BUP with respect to prediction reliability and out-of-distribution (OOD) predictions. The learned uncertainty is also analyzed in depth. The relations between uncertainty and graph topology, as well as predictive uncertainty in the OOD cases are investigated with extensive experiments. The empirical results with popular benchmark datasets demonstrate the superior performance of the proposed method. △ Less

Submitted 3 April, 2023; originally announced April 2023.

arXiv:2212.05178 [pdf, ps, other]

State-Regularized Recurrent Neural Networks to Extract Automata and Explain Predictions

Authors: Cheng Wang, Carolin Lawrence, Mathias Niepert

Abstract: Recurrent neural networks are a widely used class of neural architectures. They have, however, two shortcomings. First, they are often treated as black-box models and as such it is difficult to understand what exactly they learn as well as how they arrive at a particular prediction. Second, they tend to work poorly on sequences requiring long-term memorization, despite having this capacity in prin… ▽ More Recurrent neural networks are a widely used class of neural architectures. They have, however, two shortcomings. First, they are often treated as black-box models and as such it is difficult to understand what exactly they learn as well as how they arrive at a particular prediction. Second, they tend to work poorly on sequences requiring long-term memorization, despite having this capacity in principle. We aim to address both shortcomings with a class of recurrent networks that use a stochastic state transition mechanism between cell applications. This mechanism, which we term state-regularization, makes RNNs transition between a finite set of learnable states. We evaluate state-regularized RNNs on (1) regular languages for the purpose of automata extraction; (2) non-regular languages such as balanced parentheses and palindromes where external memory is required; and (3) real-word sequence learning tasks for sentiment analysis, visual object recognition and text categorisation. We show that state-regularization (a) simplifies the extraction of finite state automata that display an RNN's state transition dynamic; (b) forces RNNs to operate more like automata with external memory and less like finite state machines, which potentiality leads to a more structural memory; (c) leads to better interpretability and explainability of RNNs by leveraging the probabilistic finite state transition mechanism over time steps. △ Less

Submitted 9 December, 2022; originally announced December 2022.

Comments: To appear at IEEE Transactions on Pattern Analysis and Machine Intelligence. The extended version of State-Regularized Recurrent Neural Networks [arXiv:1901.08817]

arXiv:2212.00424 [pdf, other]

Multi-Source Survival Domain Adaptation

Authors: Ammar Shaker, Carolin Lawrence

Abstract: Survival analysis is the branch of statistics that studies the relation between the characteristics of living entities and their respective survival times, taking into account the partial information held by censored cases. A good analysis can, for example, determine whether one medical treatment for a group of patients is better than another. With the rise of machine learning, survival analysis c… ▽ More Survival analysis is the branch of statistics that studies the relation between the characteristics of living entities and their respective survival times, taking into account the partial information held by censored cases. A good analysis can, for example, determine whether one medical treatment for a group of patients is better than another. With the rise of machine learning, survival analysis can be modeled as learning a function that maps studied patients to their survival times. To succeed with that, there are three crucial issues to be tackled. First, some patient data is censored: we do not know the true survival times for all patients. Second, data is scarce, which led past research to treat different illness types as domains in a multi-task setup. Third, there is the need for adaptation to new or extremely rare illness types, where little or no labels are available. In contrast to previous multi-task setups, we want to investigate how to efficiently adapt to a new survival target domain from multiple survival source domains. For this, we introduce a new survival metric and the corresponding discrepancy measure between survival distributions. These allow us to define domain adaptation for survival analysis while incorporating censored data, which would otherwise have to be dropped. Our experiments on two cancer data sets reveal a superb performance on target domains, a better treatment recommendation, and a weight matrix with a plausible explanation. △ Less

Submitted 6 March, 2023; v1 submitted 1 December, 2022; originally announced December 2022.

Comments: 37th AAAI Conference on Artificial Intelligence, 2023. Includes Appendix

arXiv:2208.11024 [pdf, other]

KGxBoard: Explainable and Interactive Leaderboard for Evaluation of Knowledge Graph Completion Models

Authors: Haris Widjaja, Kiril Gashteovski, Wiem Ben Rim, Pengfei Liu, Christopher Malon, Daniel Ruffinelli, Carolin Lawrence, Graham Neubig

Abstract: Knowledge Graphs (KGs) store information in the form of (head, predicate, tail)-triples. To augment KGs with new knowledge, researchers proposed models for KG Completion (KGC) tasks such as link prediction; i.e., answering (h; p; ?) or (?; p; t) queries. Such models are usually evaluated with averaged metrics on a held-out test set. While useful for tracking progress, averaged single-score metrics… ▽ More Knowledge Graphs (KGs) store information in the form of (head, predicate, tail)-triples. To augment KGs with new knowledge, researchers proposed models for KG Completion (KGC) tasks such as link prediction; i.e., answering (h; p; ?) or (?; p; t) queries. Such models are usually evaluated with averaged metrics on a held-out test set. While useful for tracking progress, averaged single-score metrics cannot reveal what exactly a model has learned -- or failed to learn. To address this issue, we propose KGxBoard: an interactive framework for performing fine-grained evaluation on meaningful subsets of the data, each of which tests individual and interpretable capabilities of a KGC model. In our experiments, we highlight the findings that we discovered with the use of KGxBoard, which would have been impossible to detect with standard averaged single-score metrics. △ Less

Submitted 23 August, 2022; originally announced August 2022.

arXiv:2207.04447 [pdf, other]

Human-Centric Research for NLP: Towards a Definition and Guiding Questions

Authors: Bhushan Kotnis, Kiril Gashteovski, Julia Gastinger, Giuseppe Serra, Francesco Alesiani, Timo Sztyler, Ammar Shaker, Na Gong, Carolin Lawrence, Zhao Xu

Abstract: With Human-Centric Research (HCR) we can steer research activities so that the research outcome is beneficial for human stakeholders, such as end users. But what exactly makes research human-centric? We address this question by providing a working definition and define how a research pipeline can be split into different stages in which human-centric components can be added. Additionally, we discus… ▽ More With Human-Centric Research (HCR) we can steer research activities so that the research outcome is beneficial for human stakeholders, such as end users. But what exactly makes research human-centric? We address this question by providing a working definition and define how a research pipeline can be split into different stages in which human-centric components can be added. Additionally, we discuss existing NLP with HCR components and define a series of guiding questions, which can serve as starting points for researchers interested in exploring human-centric research approaches. We hope that this work would inspire researchers to refine the proposed definition and to pose other questions that might be meaningful for achieving HCR. △ Less

Submitted 10 July, 2022; originally announced July 2022.

arXiv:2205.12749 [pdf, other]

A Human-Centric Assessment Framework for AI

Authors: Sascha Saralajew, Ammar Shaker, Zhao Xu, Kiril Gashteovski, Bhushan Kotnis, Wiem Ben Rim, Jürgen Quittek, Carolin Lawrence

Abstract: With the rise of AI systems in real-world applications comes the need for reliable and trustworthy AI. An essential aspect of this are explainable AI systems. However, there is no agreed standard on how explainable AI systems should be assessed. Inspired by the Turing test, we introduce a human-centric assessment framework where a leading domain expert accepts or rejects the solutions of an AI sys… ▽ More With the rise of AI systems in real-world applications comes the need for reliable and trustworthy AI. An essential aspect of this are explainable AI systems. However, there is no agreed standard on how explainable AI systems should be assessed. Inspired by the Turing test, we introduce a human-centric assessment framework where a leading domain expert accepts or rejects the solutions of an AI system and another domain expert. By comparing the acceptance rates of provided solutions, we can assess how the AI system performs compared to the domain expert, and whether the AI system's explanations (if provided) are human-understandable. This setup -- comparable to the Turing test -- can serve as a framework for a wide range of human-centric AI system assessments. We demonstrate this by presenting two instantiations: (1) an assessment that measures the classification accuracy of a system with the option to incorporate label uncertainties; (2) an assessment where the usefulness of provided explanations is determined in a human-centric manner. △ Less

Submitted 1 July, 2022; v1 submitted 25 May, 2022; originally announced May 2022.

Comments: Accepted as submission to ICML 2022 Workshop on Human-Machine Collaboration and Teaming

arXiv:2203.05121 [pdf, other]

Collusion Detection in Team-Based Multiplayer Games

Authors: Laura Greige, Fernando De Mesentier Silva, Meredith Trotter, Chris Lawrence, Peter Chin, Dilip Varadarajan

Abstract: In the context of competitive multiplayer games, collusion happens when two or more teams decide to collaborate towards a common goal, with the intention of gaining an unfair advantage from this cooperation. The task of identifying colluders from the player population is however infeasible to game designers due to the sheer size of the player population. In this paper, we propose a system that det… ▽ More In the context of competitive multiplayer games, collusion happens when two or more teams decide to collaborate towards a common goal, with the intention of gaining an unfair advantage from this cooperation. The task of identifying colluders from the player population is however infeasible to game designers due to the sheer size of the player population. In this paper, we propose a system that detects colluding behaviors in team-based multiplayer games and highlights the players that most likely exhibit colluding behaviors. The game designers then proceed to analyze a smaller subset of players and decide what action to take. For this reason, it is important and necessary to be extremely careful with false positives when automating the detection. The proposed method analyzes the players' social relationships paired with their in-game behavioral patterns and, using tools from graph theory, infers a feature set that allows us to detect and measure the degree of collusion exhibited by each pair of players from opposing teams. We then automate the detection using Isolation Forest, an unsupervised learning technique specialized in highlighting outliers, and show the performance and efficiency of our approach on two real datasets, each with over 170,000 unique players and over 100,000 different matches. △ Less

Submitted 9 March, 2022; originally announced March 2022.

Comments: 14 pages, 4 figures

arXiv:2110.08144 [pdf, other]

milIE: Modular & Iterative Multilingual Open Information Extraction

Authors: Bhushan Kotnis, Kiril Gashteovski, Daniel Oñoro Rubio, Vanesa Rodriguez-Tembras, Ammar Shaker, Makoto Takamoto, Mathias Niepert, Carolin Lawrence

Abstract: Open Information Extraction (OpenIE) is the task of extracting (subject, predicate, object) triples from natural language sentences. Current OpenIE systems extract all triple slots independently. In contrast, we explore the hypothesis that it may be beneficial to extract triple slots iteratively: first extract easy slots, followed by the difficult ones by conditioning on the easy slots, and theref… ▽ More Open Information Extraction (OpenIE) is the task of extracting (subject, predicate, object) triples from natural language sentences. Current OpenIE systems extract all triple slots independently. In contrast, we explore the hypothesis that it may be beneficial to extract triple slots iteratively: first extract easy slots, followed by the difficult ones by conditioning on the easy slots, and therefore achieve a better overall extraction. Based on this hypothesis, we propose a neural OpenIE system, milIE, that operates in an iterative fashion. Due to the iterative nature, the system is also modular -- it is possible to seamlessly integrate rule based extraction systems with a neural end-to-end system, thereby allowing rule based systems to supply extraction slots which milIE can leverage for extracting the remaining slots. We confirm our hypothesis empirically: milIE outperforms SOTA systems on multiple languages ranging from Chinese to Arabic. Additionally, we are the first to provide an OpenIE test dataset for Arabic and Galician. △ Less

Submitted 25 April, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

arXiv:2109.07464 [pdf, other]

AnnIE: An Annotation Platform for Constructing Complete Open Information Extraction Benchmark

Authors: Niklas Friedrich, Kiril Gashteovski, Mingying Yu, Bhushan Kotnis, Carolin Lawrence, Mathias Niepert, Goran Glavaš

Abstract: Open Information Extraction (OIE) is the task of extracting facts from sentences in the form of relations and their corresponding arguments in schema-free manner. Intrinsic performance of OIE systems is difficult to measure due to the incompleteness of existing OIE benchmarks: the ground truth extractions do not group all acceptable surface realizations of the same fact that can be extracted from… ▽ More Open Information Extraction (OIE) is the task of extracting facts from sentences in the form of relations and their corresponding arguments in schema-free manner. Intrinsic performance of OIE systems is difficult to measure due to the incompleteness of existing OIE benchmarks: the ground truth extractions do not group all acceptable surface realizations of the same fact that can be extracted from a sentence. To measure performance of OIE systems more realistically, it is necessary to manually annotate complete facts (i.e., clusters of all acceptable surface realizations of the same fact) from input sentences. We propose AnnIE: an interactive annotation platform that facilitates such challenging annotation tasks and supports creation of complete fact-oriented OIE evaluation benchmarks. AnnIE is modular and flexible in order to support different use case scenarios (i.e., benchmarks covering different types of facts). We use AnnIE to build two complete OIE benchmarks: one with verb-mediated facts and another with facts encompassing named entities. Finally, we evaluate several OIE systems on our complete benchmarks created with AnnIE. Our results suggest that existing incomplete benchmarks are overly lenient, and that OIE systems are not as robust as previously reported. We publicly release AnnIE under non-restrictive license. △ Less

Submitted 13 April, 2022; v1 submitted 15 September, 2021; originally announced September 2021.

arXiv:2109.06850 [pdf, other]

BenchIE: A Framework for Multi-Faceted Fact-Based Open Information Extraction Evaluation

Authors: Kiril Gashteovski, Mingying Yu, Bhushan Kotnis, Carolin Lawrence, Mathias Niepert, Goran Glavaš

Abstract: Intrinsic evaluations of OIE systems are carried out either manually -- with human evaluators judging the correctness of extractions -- or automatically, on standardized benchmarks. The latter, while much more cost-effective, is less reliable, primarily because of the incompleteness of the existing OIE benchmarks: the ground truth extractions do not include all acceptable variants of the same fact… ▽ More Intrinsic evaluations of OIE systems are carried out either manually -- with human evaluators judging the correctness of extractions -- or automatically, on standardized benchmarks. The latter, while much more cost-effective, is less reliable, primarily because of the incompleteness of the existing OIE benchmarks: the ground truth extractions do not include all acceptable variants of the same fact, leading to unreliable assessment of the models' performance. Moreover, the existing OIE benchmarks are available for English only. In this work, we introduce BenchIE: a benchmark and evaluation framework for comprehensive evaluation of OIE systems for English, Chinese, and German. In contrast to existing OIE benchmarks, BenchIE is fact-based, i.e., it takes into account informational equivalence of extractions: our gold standard consists of fact synsets, clusters in which we exhaustively list all acceptable surface forms of the same fact. Moreover, having in mind common downstream applications for OIE, we make BenchIE multi-faceted; i.e., we create benchmark variants that focus on different facets of OIE evaluation, e.g., compactness or minimality of extractions. We benchmark several state-of-the-art OIE systems using BenchIE and demonstrate that these systems are significantly less effective than indicated by existing OIE benchmarks. We make BenchIE (data and evaluation code) publicly available on https://github.com/gkiril/benchie. △ Less

Submitted 13 April, 2022; v1 submitted 14 September, 2021; originally announced September 2021.

arXiv:2106.13642 [pdf, other]

VEGN: Variant Effect Prediction with Graph Neural Networks

Authors: Jun Cheng, Carolin Lawrence, Mathias Niepert

Abstract: Genetic mutations can cause disease by disrupting normal gene function. Identifying the disease-causing mutations from millions of genetic variants within an individual patient is a challenging problem. Computational methods which can prioritize disease-causing mutations have, therefore, enormous applications. It is well-known that genes function through a complex regulatory network. However, exis… ▽ More Genetic mutations can cause disease by disrupting normal gene function. Identifying the disease-causing mutations from millions of genetic variants within an individual patient is a challenging problem. Computational methods which can prioritize disease-causing mutations have, therefore, enormous applications. It is well-known that genes function through a complex regulatory network. However, existing variant effect prediction models only consider a variant in isolation. In contrast, we propose VEGN, which models variant effect prediction using a graph neural network (GNN) that operates on a heterogeneous graph with genes and variants. The graph is created by assigning variants to genes and connecting genes with an gene-gene interaction network. In this context, we explore an approach where a gene-gene graph is given and another where VEGN learns the gene-gene graph and therefore operates both on given and learnt edges. The graph neural network is trained to aggregate information between genes, and between genes and variants. Variants can exchange information via the genes they connect to. This approach improves the performance of existing state-of-the-art models. △ Less

Submitted 25 June, 2021; originally announced June 2021.

Comments: Accepted at Workshop on Computational Biology, co-located with the 38th International Conference on Machine Learning

arXiv:2102.10648 [pdf, ps, other]

Dimensions of Timescales in Neuromorphic Computing Systems

Authors: Herbert Jaeger, Dirk Doorakkers, Celestine Lawrence, Giacomo Indiveri

Abstract: This article is a public deliverable of the EU project "Memory technologies with multi-scale time constants for neuromorphic architectures" (MeMScales, https://memscales.eu, Call ICT-06-2019 Unconventional Nanoelectronics, project number 871371). This arXiv version is a verbatim copy of the deliverable report, with administrative information stripped. It collects a wide and varied assortment of ph… ▽ More This article is a public deliverable of the EU project "Memory technologies with multi-scale time constants for neuromorphic architectures" (MeMScales, https://memscales.eu, Call ICT-06-2019 Unconventional Nanoelectronics, project number 871371). This arXiv version is a verbatim copy of the deliverable report, with administrative information stripped. It collects a wide and varied assortment of phenomena, models, research themes and algorithmic techniques that are connected with timescale phenomena in the fields of computational neuroscience, mathematics, machine learning and computer science, with a bias toward aspects that are relevant for neuromorphic engineering. It turns out that this theme is very rich indeed and spreads out in many directions which defy a unified treatment. We collected several dozens of sub-themes, each of which has been investigated in specialized settings (in the neurosciences, mathematics, computer science and machine learning) and has been documented in its own body of literature. The more we dived into this diversity, the more it became clear that our first effort to compose a survey must remain sketchy and partial. We conclude with a list of insights distilled from this survey which give general guidelines for the design of future neuromorphic systems. △ Less

Submitted 21 February, 2021; originally announced February 2021.

arXiv:2011.12010 [pdf, other]

Uncertainty Estimation and Calibration with Finite-State Probabilistic RNNs

Authors: Cheng Wang, Carolin Lawrence, Mathias Niepert

Abstract: Uncertainty quantification is crucial for building reliable and trustable machine learning systems. We propose to estimate uncertainty in recurrent neural networks (RNNs) via stochastic discrete state transitions over recurrent timesteps. The uncertainty of the model can be quantified by running a prediction several times, each time sampling from the recurrent state transition distribution, leadin… ▽ More Uncertainty quantification is crucial for building reliable and trustable machine learning systems. We propose to estimate uncertainty in recurrent neural networks (RNNs) via stochastic discrete state transitions over recurrent timesteps. The uncertainty of the model can be quantified by running a prediction several times, each time sampling from the recurrent state transition distribution, leading to potentially different results if the model is uncertain. Alongside uncertainty quantification, our proposed method offers several advantages in different settings. The proposed method can (1) learn deterministic and probabilistic automata from data, (2) learn well-calibrated models on real-world classification tasks, (3) improve the performance of out-of-distribution detection, and (4) control the exploration-exploitation trade-off in reinforcement learning. △ Less

Submitted 24 November, 2020; originally announced November 2020.

arXiv:2011.02511 [pdf, ps, other]

Offline Reinforcement Learning from Human Feedback in Real-World Sequence-to-Sequence Tasks

Authors: Julia Kreutzer, Stefan Riezler, Carolin Lawrence

Abstract: Large volumes of interaction logs can be collected from NLP systems that are deployed in the real world. How can this wealth of information be leveraged? Using such interaction logs in an offline reinforcement learning (RL) setting is a promising approach. However, due to the nature of NLP tasks and the constraints of production systems, a series of challenges arise. We present a concise overview… ▽ More Large volumes of interaction logs can be collected from NLP systems that are deployed in the real world. How can this wealth of information be leveraged? Using such interaction logs in an offline reinforcement learning (RL) setting is a promising approach. However, due to the nature of NLP tasks and the constraints of production systems, a series of challenges arise. We present a concise overview of these challenges and discuss possible solutions. △ Less

Submitted 9 June, 2021; v1 submitted 4 November, 2020; originally announced November 2020.

Comments: 5th Workshop on Structured Prediction for NLP at ACL 2021 Previously named "Learning from Human Feedback: Challenges for Real-World Reinforcement Learning in NLP" and presented at Challenges of Real-World RL Workshop at NeurIPS 2020

arXiv:2010.05516 [pdf, other]

Explaining Neural Matrix Factorization with Gradient Rollback

Authors: Carolin Lawrence, Timo Sztyler, Mathias Niepert

Abstract: Explaining the predictions of neural black-box models is an important problem, especially when such models are used in applications where user trust is crucial. Estimating the influence of training examples on a learned neural model's behavior allows us to identify training examples most responsible for a given prediction and, therefore, to faithfully explain the output of a black-box model. The m… ▽ More Explaining the predictions of neural black-box models is an important problem, especially when such models are used in applications where user trust is crucial. Estimating the influence of training examples on a learned neural model's behavior allows us to identify training examples most responsible for a given prediction and, therefore, to faithfully explain the output of a black-box model. The most generally applicable existing method is based on influence functions, which scale poorly for larger sample sizes and models. We propose gradient rollback, a general approach for influence estimation, applicable to neural models where each parameter update step during gradient descent touches a smaller number of parameters, even if the overall number of parameters is large. Neural matrix factorization models trained with gradient descent are part of this model class. These models are popular and have found a wide range of applications in industry. Especially knowledge graph embedding methods, which belong to this class, are used extensively. We show that gradient rollback is highly efficient at both training and test time. Moreover, we show theoretically that the difference between gradient rollback's influence approximation and the true influence on a model's behavior is smaller than known bounds on the stability of stochastic gradient descent. This establishes that gradient rollback is robustly estimating example influence. We also conduct experiments which show that gradient rollback provides faithful explanations for knowledge base completion and recommender datasets. △ Less

Submitted 15 December, 2020; v1 submitted 12 October, 2020; originally announced October 2020.

Comments: 35th AAAI Conference on Artificial Intelligence, 2021. Includes Appendix

arXiv:2004.02596 [pdf, other]

Answering Complex Queries in Knowledge Graphs with Bidirectional Sequence Encoders

Authors: Bhushan Kotnis, Carolin Lawrence, Mathias Niepert

Abstract: Representation learning for knowledge graphs (KGs) has focused on the problem of answering simple link prediction queries. In this work we address the more ambitious challenge of predicting the answers of conjunctive queries with multiple missing entities. We propose Bi-Directional Query Embedding (BIQE), a method that embeds conjunctive queries with models based on bi-directional attention mechan… ▽ More Representation learning for knowledge graphs (KGs) has focused on the problem of answering simple link prediction queries. In this work we address the more ambitious challenge of predicting the answers of conjunctive queries with multiple missing entities. We propose Bi-Directional Query Embedding (BIQE), a method that embeds conjunctive queries with models based on bi-directional attention mechanisms. Contrary to prior work, bidirectional self-attention can capture interactions among all the elements of a query graph. We introduce a new dataset for predicting the answer of conjunctive query and conduct experiments that show BIQE significantly outperforming state of the art baselines. △ Less

Submitted 4 February, 2021; v1 submitted 6 April, 2020; originally announced April 2020.

Comments: 8 pages, 2 figures

arXiv:1908.08737 [pdf, other]

Design choices for productive, secure, data-intensive research at scale in the cloud

Authors: Diego Arenas, Jon Atkins, Claire Austin, David Beavan, Alvaro Cabrejas Egea, Steven Carlysle-Davies, Ian Carter, Rob Clarke, James Cunningham, Tom Doel, Oliver Forrest, Evelina Gabasova, James Geddes, James Hetherington, Radka Jersakova, Franz Kiraly, Catherine Lawrence, Jules Manser, Martin T. O'Reilly, James Robinson, Helen Sherwood-Taylor, Serena Tierney, Catalina A. Vallejos, Sebastian Vollmer, Kirstie Whitaker

Abstract: We present a policy and process framework for secure environments for productive data science research projects at scale, by combining prevailing data security threat and risk profiles into five sensitivity tiers, and, at each tier, specifying recommended policies for data classification, data ingress, software ingress, data egress, user access, user device control, and analysis environments. By p… ▽ More We present a policy and process framework for secure environments for productive data science research projects at scale, by combining prevailing data security threat and risk profiles into five sensitivity tiers, and, at each tier, specifying recommended policies for data classification, data ingress, software ingress, data egress, user access, user device control, and analysis environments. By presenting design patterns for security choices for each tier, and using software defined infrastructure so that a different, independent, secure research environment can be instantiated for each project appropriate to its classification, we hope to maximise researcher productivity and minimise risk, allowing research organisations to operate with confidence. △ Less

Submitted 15 September, 2019; v1 submitted 23 August, 2019; originally announced August 2019.

arXiv:1908.05915 [pdf, other]

Attending to Future Tokens For Bidirectional Sequence Generation

Authors: Carolin Lawrence, Bhushan Kotnis, Mathias Niepert

Abstract: Neural sequence generation is typically performed token-by-token and left-to-right. Whenever a token is generated only previously produced tokens are taken into consideration. In contrast, for problems such as sequence classification, bidirectional attention, which takes both past and future tokens into consideration, has been shown to perform much better. We propose to make the sequence generatio… ▽ More Neural sequence generation is typically performed token-by-token and left-to-right. Whenever a token is generated only previously produced tokens are taken into consideration. In contrast, for problems such as sequence classification, bidirectional attention, which takes both past and future tokens into consideration, has been shown to perform much better. We propose to make the sequence generation process bidirectional by employing special placeholder tokens. Treated as a node in a fully connected graph, a placeholder token can take past and future tokens into consideration when generating the actual output token. We verify the effectiveness of our approach experimentally on two conversational tasks where the proposed bidirectional model outperforms competitive baselines by a large margin. △ Less

Submitted 17 September, 2019; v1 submitted 16 August, 2019; originally announced August 2019.

Comments: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019, Hong Kong, China

arXiv:1907.08738 [pdf, other]

Bayesian Inference Gaussian Process Multiproxy Alignment of Continuous Signals (BIGMACS): Applications for Paleoceanography

Authors: Taehee Lee, Lorraine E. Lisiecki, Devin Rand, Geoffrey Gebbie, Charles E. Lawrence

Abstract: We first introduce a novel profile-based alignment algorithm, the multiple continuous Signal Alignment algorithm with Gaussian Process Regression profiles (SA-GPR). SA-GPR addresses the limitations of currently available signal alignment methods by adopting a hybrid of the particle smoothing and Markov-chain Monte Carlo (MCMC) algorithms to align signals, and by applying the Gaussian process regre… ▽ More We first introduce a novel profile-based alignment algorithm, the multiple continuous Signal Alignment algorithm with Gaussian Process Regression profiles (SA-GPR). SA-GPR addresses the limitations of currently available signal alignment methods by adopting a hybrid of the particle smoothing and Markov-chain Monte Carlo (MCMC) algorithms to align signals, and by applying the Gaussian process regression to construct profiles to be aligned continuously. SA-GPR shares all the strengths of the existing alignment algorithms that depend on profiles but is more exact in the sense that profiles do not need to be discretized as sequential bins. The uncertainty of performance over the resolution of such bins is thereby eliminated. This methodology produces alignments that are consistent, that regularize extreme cases, and that properly reflect the inherent uncertainty. Then we extend SA-GPR to a specific problem in the field of paleoceanography with a method called Bayesian Inference Gaussian Process Multiproxy Alignment of Continuous Signals (BIGMACS). The goal of BIGMACS is to infer continuous ages for ocean sediment cores using two classes of age proxies: proxies that explicitly return calendar ages (e.g., radiocarbon) and those used to synchronize ages in multiple marine records (e.g., an oxygen isotope based marine proxy known as benthic $δ^{18}{\rm O}$). BIGMACS integrates these two proxies by iteratively performing two steps: profile construction from benthic $δ^{18}{\rm O}$ age models and alignment of each core to the profile also reflecting radiocarbon dates. We use BIGMACS to construct a new Deep Northeastern Atlantic stack (i.e., a profile from a particular benthic $δ^{18}{\rm O}$ records) of five ocean sediment cores. We conclude by constructing multiproxy age models for two additional cores from the same region by aligning them to the stack. △ Less

Submitted 13 June, 2021; v1 submitted 19 July, 2019; originally announced July 2019.

Comments: This article has been submitted to "Bayesian Analysis"

arXiv:1907.03748 [pdf, other]

Learning Neural Sequence-to-Sequence Models from Weak Feedback with Bipolar Ramp Loss

Authors: Laura Jehl, Carolin Lawrence, Stefan Riezler

Abstract: In many machine learning scenarios, supervision by gold labels is not available and consequently neural models cannot be trained directly by maximum likelihood estimation (MLE). In a weak supervision scenario, metric-augmented objectives can be employed to assign feedback to model outputs, which can be used to extract a supervision signal for training. We present several objectives for two separat… ▽ More In many machine learning scenarios, supervision by gold labels is not available and consequently neural models cannot be trained directly by maximum likelihood estimation (MLE). In a weak supervision scenario, metric-augmented objectives can be employed to assign feedback to model outputs, which can be used to extract a supervision signal for training. We present several objectives for two separate weakly supervised tasks, machine translation and semantic parsing. We show that objectives should actively discourage negative outputs in addition to promoting a surrogate gold structure. This notion of bipolarity is naturally present in ramp loss objectives, which we adapt to neural models. We show that bipolar ramp loss objectives outperform other non-bipolar ramp loss objectives and minimum risk training (MRT) on both weakly supervised tasks, as well as on a supervised machine translation task. Additionally, we introduce a novel token-level ramp loss objective, which is able to outperform even the best sequence-level ramp loss on both weakly supervised tasks. △ Less

Submitted 6 July, 2019; originally announced July 2019.

Comments: Transactions of the Association for Computational Linguistics 2019 Vol. 7, 233-248. Presented at ACL, Florence, Italy

arXiv:1811.12239 [pdf, other]

Counterfactual Learning from Human Proofreading Feedback for Semantic Parsing

Authors: Carolin Lawrence, Stefan Riezler

Abstract: In semantic parsing for question-answering, it is often too expensive to collect gold parses or even gold answers as supervision signals. We propose to convert model outputs into a set of human-understandable statements which allow non-expert users to act as proofreaders, providing error markings as learning signals to the parser. Because model outputs were suggested by a historic system, we opera… ▽ More In semantic parsing for question-answering, it is often too expensive to collect gold parses or even gold answers as supervision signals. We propose to convert model outputs into a set of human-understandable statements which allow non-expert users to act as proofreaders, providing error markings as learning signals to the parser. Because model outputs were suggested by a historic system, we operate in a counterfactual, or off-policy, learning setup. We introduce new estimators which can effectively leverage the given feedback and which avoid known degeneracies in counterfactual learning, while still being applicable to stochastic gradient optimization for neural semantic parsing. Furthermore, we discuss how our feedback collection method can be seamlessly integrated into deployed virtual personal assistants that embed a semantic parser. Our work is the first to show that semantic parsers can be improved significantly by counterfactual learning from logged human feedback data. △ Less

Submitted 29 November, 2018; originally announced November 2018.

Comments: "Learning by Instruction" Workshop at the 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada. arXiv admin note: substantial text overlap with arXiv:1805.01252

arXiv:1805.01252 [pdf, other]

Improving a Neural Semantic Parser by Counterfactual Learning from Human Bandit Feedback

Authors: Carolin Lawrence, Stefan Riezler

Abstract: Counterfactual learning from human bandit feedback describes a scenario where user feedback on the quality of outputs of a historic system is logged and used to improve a target system. We show how to apply this learning framework to neural semantic parsing. From a machine learning perspective, the key challenge lies in a proper reweighting of the estimator so as to avoid known degeneracies in cou… ▽ More Counterfactual learning from human bandit feedback describes a scenario where user feedback on the quality of outputs of a historic system is logged and used to improve a target system. We show how to apply this learning framework to neural semantic parsing. From a machine learning perspective, the key challenge lies in a proper reweighting of the estimator so as to avoid known degeneracies in counterfactual learning, while still being applicable to stochastic gradient optimization. To conduct experiments with human users, we devise an easy-to-use interface to collect human feedback on semantic parses. Our work is the first to show that semantic parsers can be improved significantly by counterfactual learning from logged human feedback data. △ Less

Submitted 30 November, 2018; v1 submitted 3 May, 2018; originally announced May 2018.

Comments: Conference of the Association for Computational Linguistics (ACL), 2018, Melbourne, Australia

arXiv:1711.08621 [pdf, ps, other]

Counterfactual Learning for Machine Translation: Degeneracies and Solutions

Authors: Carolin Lawrence, Pratik Gajane, Stefan Riezler

Abstract: Counterfactual learning is a natural scenario to improve web-based machine translation services by offline learning from feedback logged during user interactions. In order to avoid the risk of showing inferior translations to users, in such scenarios mostly exploration-free deterministic logging policies are in place. We analyze possible degeneracies of inverse and reweighted propensity scoring es… ▽ More Counterfactual learning is a natural scenario to improve web-based machine translation services by offline learning from feedback logged during user interactions. In order to avoid the risk of showing inferior translations to users, in such scenarios mostly exploration-free deterministic logging policies are in place. We analyze possible degeneracies of inverse and reweighted propensity scoring estimators, in stochastic and deterministic settings, and relate them to recently proposed techniques for counterfactual learning under deterministic logging. △ Less

Submitted 14 December, 2017; v1 submitted 23 November, 2017; originally announced November 2017.

Comments: Workshop "From 'What If?' To 'What Next?'" at the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA

arXiv:1707.09118 [pdf, other]

Counterfactual Learning from Bandit Feedback under Deterministic Logging: A Case Study in Statistical Machine Translation

Authors: Carolin Lawrence, Artem Sokolov, Stefan Riezler

Abstract: The goal of counterfactual learning for statistical machine translation (SMT) is to optimize a target SMT system from logged data that consist of user feedback to translations that were predicted by another, historic SMT system. A challenge arises by the fact that risk-averse commercial SMT systems deterministically log the most probable translation. The lack of sufficient exploration of the SMT o… ▽ More The goal of counterfactual learning for statistical machine translation (SMT) is to optimize a target SMT system from logged data that consist of user feedback to translations that were predicted by another, historic SMT system. A challenge arises by the fact that risk-averse commercial SMT systems deterministically log the most probable translation. The lack of sufficient exploration of the SMT output space seemingly contradicts the theoretical requirements for counterfactual learning. We show that counterfactual learning from deterministic bandit logs is possible nevertheless by smoothing out deterministic components in learning. This can be achieved by additive and multiplicative control variates that avoid degenerate behavior in empirical risk minimization. Our simulation experiments show improvements of up to 2 BLEU points by counterfactual learning from deterministic bandit feedback. △ Less

Submitted 14 December, 2017; v1 submitted 28 July, 2017; originally announced July 2017.

Comments: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017, Copenhagen, Denmark

Showing 1–29 of 29 results for author: Lawrence, C