Search | arXiv e-print repository

Interpretability of Language Models via Task Spaces

Authors: Lucas Weber, Jaap Jumelet, Elia Bruni, Dieuwke Hupkes

Abstract: The usual way to interpret language models (LMs) is to test their performance on different benchmarks and subsequently infer their internal processes. In this paper, we present an alternative approach, concentrating on the quality of LM processing, with a focus on their language abilities. To this end, we construct 'linguistic task spaces' -- representations of an LM's language conceptualisation -… ▽ More The usual way to interpret language models (LMs) is to test their performance on different benchmarks and subsequently infer their internal processes. In this paper, we present an alternative approach, concentrating on the quality of LM processing, with a focus on their language abilities. To this end, we construct 'linguistic task spaces' -- representations of an LM's language conceptualisation -- that shed light on the connections LMs draw between language phenomena. Task spaces are based on the interactions of the learning signals from different linguistic phenomena, which we assess via a method we call 'similarity probing'. To disentangle the learning signals of linguistic phenomena, we further introduce a method called 'fine-tuning via gradient differentials' (FTGD). We apply our methods to language models of three different scales and find that larger models generalise better to overarching general concepts for linguistic tasks, making better use of their shared structure. Further, the distributedness of linguistic processing increases with pre-training through increased parameter sharing between related linguistic tasks. The overall generalisation patterns are mostly stable throughout training and not marked by incisive stages, potentially explaining the lack of successful curriculum strategies for LMs. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: To be published at ACL 2024 (main)

arXiv:2406.04766 [pdf, other]

Reinforcement Learning and Regret Bounds for Admission Control

Authors: Lucas Weber, Ana Bušić, Jiamin Zhu

Abstract: The expected regret of any reinforcement learning algorithm is lower bounded by $Ω\left(\sqrt{DXAT}\right)$ for undiscounted returns, where $D$ is the diameter of the Markov decision process, $X$ the size of the state space, $A$ the size of the action space and $T$ the number of time steps. However, this lower bound is general. A smaller regret can be obtained by taking into account some specific… ▽ More The expected regret of any reinforcement learning algorithm is lower bounded by $Ω\left(\sqrt{DXAT}\right)$ for undiscounted returns, where $D$ is the diameter of the Markov decision process, $X$ the size of the state space, $A$ the size of the action space and $T$ the number of time steps. However, this lower bound is general. A smaller regret can be obtained by taking into account some specific knowledge of the problem structure. In this article, we consider an admission control problem to an $M/M/c/S$ queue with $m$ job classes and class-dependent rewards and holding costs. Queuing systems often have a diameter that is exponential in the buffer size $S$, making the previous lower bound prohibitive for any practical use. We propose an algorithm inspired by UCRL2, and use the structure of the problem to upper bound the expected total regret by $O(S\log T + \sqrt{mT \log T})$ in the finite server case. In the infinite server case, we prove that the dependence of the regret on $S$ disappears. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2405.17202 [pdf, other]

Efficient multi-prompt evaluation of LLMs

Authors: Felipe Maia Polo, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, Mikhail Yurochkin

Abstract: Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs' abilities and can affect the reproducibility of results on leaderboards. Many recent works empirically verify prompt sensitivity and advocate for changes in LLM evaluation. In this paper, we consider the problem of estimating the performance distribution across many prompt va… ▽ More Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs' abilities and can affect the reproducibility of results on leaderboards. Many recent works empirically verify prompt sensitivity and advocate for changes in LLM evaluation. In this paper, we consider the problem of estimating the performance distribution across many prompt variants instead of finding a single prompt to evaluate with. We introduce PromptEval, a method for estimating performance across a large set of prompts borrowing strength across prompts and examples to produce accurate estimates under practical evaluation budgets. The resulting distribution can be used to obtain performance quantiles to construct various robust performance metrics (e.g., top 95% quantile or median). We prove that PromptEval consistently estimates the performance distribution and demonstrate its efficacy empirically on three prominent LLM benchmarks: MMLU, BIG-bench Hard, and LMentry. For example, PromptEval can accurately estimate performance quantiles across 100 prompt templates on MMLU with a budget equivalent to two single-prompt evaluations. Our code and data can be found at https://github.com/felipemaiapolo/prompt-eval. △ Less

Submitted 7 June, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.02383 [pdf, other]

A Fresh Look at Sanity Checks for Saliency Maps

Authors: Anna Hedström, Leander Weber, Sebastian Lapuschkin, Marina Höhne

Abstract: The Model Parameter Randomisation Test (MPRT) is highly recognised in the eXplainable Artificial Intelligence (XAI) community due to its fundamental evaluative criterion: explanations should be sensitive to the parameters of the model they seek to explain. However, recent studies have raised several methodological concerns for the empirical interpretation of MPRT. In response, we propose two modif… ▽ More The Model Parameter Randomisation Test (MPRT) is highly recognised in the eXplainable Artificial Intelligence (XAI) community due to its fundamental evaluative criterion: explanations should be sensitive to the parameters of the model they seek to explain. However, recent studies have raised several methodological concerns for the empirical interpretation of MPRT. In response, we propose two modifications to the original test: Smooth MPRT and Efficient MPRT. The former reduces the impact of noise on evaluation outcomes via sampling, while the latter avoids the need for biased similarity measurements by re-interpreting the test through the increase in explanation complexity after full model randomisation. Our experiments show that these modifications enhance the metric reliability, facilitating a more trustworthy deployment of explanation methods. △ Less

Submitted 3 May, 2024; originally announced May 2024.

Comments: arXiv admin note: text overlap with arXiv:2401.06465

arXiv:2403.07137 [pdf, other]

Exploring Cluster Analysis in Nelore Cattle Visual Score Attribution

Authors: Alexandre de Oliveira Bezerra, Rodrigo Goncalves Mateus, Vanessa Ap. de Moraes Weber, Fabricio de Lima Weber, Yasmin Alves de Arruda, Rodrigo da Costa Gomes, Gabriel Toshio Hirokawa Higa, Hemerson Pistori

Abstract: Assessing the biotype of cattle through human visual inspection is a very common and important practice in precision cattle breeding. This paper presents the results of a correlation analysis between scores produced by humans for Nelore cattle and a variety of measurements that can be derived from images or other instruments. It also presents a study using the k-means algorithm to generate new way… ▽ More Assessing the biotype of cattle through human visual inspection is a very common and important practice in precision cattle breeding. This paper presents the results of a correlation analysis between scores produced by humans for Nelore cattle and a variety of measurements that can be derived from images or other instruments. It also presents a study using the k-means algorithm to generate new ways of clustering a batch of cattle using the measurements that most correlate with the animal's body weight and visual scores. △ Less

Submitted 11 March, 2024; originally announced March 2024.

arXiv:2402.14992 [pdf, other]

tinyBenchmarks: evaluating LLMs with fewer examples

Authors: Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, Mikhail Yurochkin

Abstract: The versatility of large language models (LLMs) led to the creation of diverse benchmarks that thoroughly test a variety of language models' abilities. These benchmarks consist of tens of thousands of examples making evaluation of LLMs very expensive. In this paper, we investigate strategies to reduce the number of evaluations needed to assess the performance of an LLM on several key benchmarks. F… ▽ More The versatility of large language models (LLMs) led to the creation of diverse benchmarks that thoroughly test a variety of language models' abilities. These benchmarks consist of tens of thousands of examples making evaluation of LLMs very expensive. In this paper, we investigate strategies to reduce the number of evaluations needed to assess the performance of an LLM on several key benchmarks. For example, we show that to accurately estimate the performance of an LLM on MMLU, a popular multiple-choice QA benchmark consisting of 14K examples, it is sufficient to evaluate this LLM on 100 curated examples. We release evaluation tools and tiny versions of popular benchmarks: Open LLM Leaderboard, MMLU, HELM, and AlpacaEval 2.0. Our empirical analysis demonstrates that these tools and tiny benchmarks are sufficient to reliably and efficiently reproduce the original evaluation results. △ Less

Submitted 26 May, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

Comments: Proceedings of the 41st International Conference on Machine Learning (ICML)

arXiv:2401.06465 [pdf, other]

Sanity Checks Revisited: An Exploration to Repair the Model Parameter Randomisation Test

Authors: Anna Hedström, Leander Weber, Sebastian Lapuschkin, Marina MC Höhne

Abstract: The Model Parameter Randomisation Test (MPRT) is widely acknowledged in the eXplainable Artificial Intelligence (XAI) community for its well-motivated evaluative principle: that the explanation function should be sensitive to changes in the parameters of the model function. However, recent works have identified several methodological caveats for the empirical interpretation of MPRT. To address the… ▽ More The Model Parameter Randomisation Test (MPRT) is widely acknowledged in the eXplainable Artificial Intelligence (XAI) community for its well-motivated evaluative principle: that the explanation function should be sensitive to changes in the parameters of the model function. However, recent works have identified several methodological caveats for the empirical interpretation of MPRT. To address these caveats, we introduce two adaptations to the original MPRT -- Smooth MPRT and Efficient MPRT, where the former minimises the impact that noise has on the evaluation results through sampling and the latter circumvents the need for biased similarity measurements by re-interpreting the test through the explanation's rise in complexity, after full parameter randomisation. Our experimental results demonstrate that these proposed variants lead to improved metric reliability, thus enabling a more trustworthy application of XAI methods. △ Less

Submitted 12 January, 2024; originally announced January 2024.

Comments: 19 pages, 12 figures, NeurIPS XAIA 2023

arXiv:2312.04945 [pdf, other]

The ICL Consistency Test

Authors: Lucas Weber, Elia Bruni, Dieuwke Hupkes

Abstract: Just like the previous generation of task-tuned models, large language models (LLMs) that are adapted to tasks via prompt-based methods like in-context-learning (ICL) perform well in some setups but not in others. This lack of consistency in prompt-based learning hints at a lack of robust generalisation. We here introduce the ICL consistency test -- a contribution to the GenBench collaborative ben… ▽ More Just like the previous generation of task-tuned models, large language models (LLMs) that are adapted to tasks via prompt-based methods like in-context-learning (ICL) perform well in some setups but not in others. This lack of consistency in prompt-based learning hints at a lack of robust generalisation. We here introduce the ICL consistency test -- a contribution to the GenBench collaborative benchmark task (CBT) -- which evaluates how consistent a model makes predictions across many different setups while using the same data. The test is based on different established natural language inference tasks. We provide preprocessed data constituting 96 different 'setups' and a metric that estimates model consistency across these setups. The metric is provided on a fine-grained level to understand what properties of a setup render predictions unstable and on an aggregated level to compare overall model consistency. We conduct an empirical analysis of eight state-of-the-art models, and our consistency metric reveals how all tested LLMs lack robust generalisation. △ Less

Submitted 8 December, 2023; originally announced December 2023.

Comments: Accepted as non-archival submission to the GenBench Workshop 2023. arXiv admin note: substantial text overlap with arXiv:2310.13486

arXiv:2310.13486 [pdf, other]

Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning

Authors: Lucas Weber, Elia Bruni, Dieuwke Hupkes

Abstract: Finding the best way of adapting pre-trained language models to a task is a big challenge in current NLP. Just like the previous generation of task-tuned models (TT), models that are adapted to tasks via in-context-learning (ICL) are robust in some setups but not in others. Here, we present a detailed analysis of which design choices cause instabilities and inconsistencies in LLM predictions. Firs… ▽ More Finding the best way of adapting pre-trained language models to a task is a big challenge in current NLP. Just like the previous generation of task-tuned models (TT), models that are adapted to tasks via in-context-learning (ICL) are robust in some setups but not in others. Here, we present a detailed analysis of which design choices cause instabilities and inconsistencies in LLM predictions. First, we show how spurious correlations between input distributions and labels -- a known issue in TT models -- form only a minor problem for prompted models. Then, we engage in a systematic, holistic evaluation of different factors that have been found to influence predictions in a prompting setup. We test all possible combinations of a range of factors on both vanilla and instruction-tuned (IT) LLMs of different scale and statistically analyse the results to show which factors are the most influential, interactive or stable. Our results show which factors can be used without precautions and which should be avoided or handled with care in most settings. △ Less

Submitted 20 October, 2023; originally announced October 2023.

arXiv:2310.05442 [pdf, other]

Establishing Trustworthiness: Rethinking Tasks and Model Evaluation

Authors: Robert Litschko, Max Müller-Eberstein, Rob van der Goot, Leon Weber, Barbara Plank

Abstract: Language understanding is a multi-faceted cognitive capability, which the Natural Language Processing (NLP) community has striven to model computationally for decades. Traditionally, facets of linguistic intelligence have been compartmentalized into tasks with specialized model architectures and corresponding evaluation protocols. With the advent of large language models (LLMs) the community has w… ▽ More Language understanding is a multi-faceted cognitive capability, which the Natural Language Processing (NLP) community has striven to model computationally for decades. Traditionally, facets of linguistic intelligence have been compartmentalized into tasks with specialized model architectures and corresponding evaluation protocols. With the advent of large language models (LLMs) the community has witnessed a dramatic shift towards general purpose, task-agnostic approaches powered by generative models. As a consequence, the traditional compartmentalized notion of language tasks is breaking down, followed by an increasing challenge for evaluation and analysis. At the same time, LLMs are being deployed in more real-world scenarios, including previously unforeseen zero-shot setups, increasing the need for trustworthy and reliable systems. Therefore, we argue that it is time to rethink what constitutes tasks and model evaluation in NLP, and pursue a more holistic view on language, placing trustworthiness at the center. Towards this goal, we review existing compartmentalized approaches for understanding the origins of a model's functional capacity, and provide recommendations for more multi-faceted evaluation protocols. △ Less

Submitted 23 October, 2023; v1 submitted 9 October, 2023; originally announced October 2023.

Comments: Accepted at EMNLP 2023 (Main Conference), camera-ready

arXiv:2308.12202 [pdf, other]

Curriculum Learning with Adam: The Devil Is in the Wrong Details

Authors: Lucas Weber, Jaap Jumelet, Paul Michel, Elia Bruni, Dieuwke Hupkes

Abstract: Curriculum learning (CL) posits that machine learning models -- similar to humans -- may learn more efficiently from data that match their current learning progress. However, CL methods are still poorly understood and, in particular for natural language processing (NLP), have achieved only limited success. In this paper, we explore why. Starting from an attempt to replicate and extend a number of… ▽ More Curriculum learning (CL) posits that machine learning models -- similar to humans -- may learn more efficiently from data that match their current learning progress. However, CL methods are still poorly understood and, in particular for natural language processing (NLP), have achieved only limited success. In this paper, we explore why. Starting from an attempt to replicate and extend a number of recent curriculum methods, we find that their results are surprisingly brittle when applied to NLP. A deep dive into the (in)effectiveness of the curricula in some scenarios shows us why: when curricula are employed in combination with the popular Adam optimisation algorithm, they oftentimes learn to adapt to suboptimally chosen optimisation parameters for this algorithm. We present a number of different case studies with different common hand-crafted and automated CL approaches to illustrate this phenomenon, and we find that none of them outperforms optimisation with only Adam with well-chosen hyperparameters. As such, our results contribute to understanding why CL methods work, but at the same time urge caution when claiming positive results. △ Less

Submitted 23 August, 2023; originally announced August 2023.

arXiv:2308.12053 [pdf, other]

Layer-wise Feedback Propagation

Authors: Leander Weber, Jim Berend, Alexander Binder, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin

Abstract: In this paper, we present Layer-wise Feedback Propagation (LFP), a novel training approach for neural-network-like predictors that utilizes explainability, specifically Layer-wise Relevance Propagation(LRP), to assign rewards to individual connections based on their respective contributions to solving a given task. This differs from traditional gradient descent, which updates parameters towards an… ▽ More In this paper, we present Layer-wise Feedback Propagation (LFP), a novel training approach for neural-network-like predictors that utilizes explainability, specifically Layer-wise Relevance Propagation(LRP), to assign rewards to individual connections based on their respective contributions to solving a given task. This differs from traditional gradient descent, which updates parameters towards anestimated loss minimum. LFP distributes a reward signal throughout the model without the need for gradient computations. It then strengthens structures that receive positive feedback while reducingthe influence of structures that receive negative feedback. We establish the convergence of LFP theoretically and empirically, and demonstrate its effectiveness in achieving comparable performance to gradient descent on various models and datasets. Notably, LFP overcomes certain limitations associated with gradient-based methods, such as reliance on meaningful derivatives. We further investigate how the different LRP-rules can be extended to LFP, what their effects are on training, as well as potential applications, such as training models with no meaningful derivatives, e.g., step-function activated Spiking Neural Networks (SNNs), or for transfer learning, to efficiently utilize existing knowledge. △ Less

Submitted 23 August, 2023; originally announced August 2023.

MSC Class: 68T05

arXiv:2305.20045 [pdf, other]

ActiveAED: A Human in the Loop Improves Annotation Error Detection

Authors: Leon Weber, Barbara Plank

Abstract: Manually annotated datasets are crucial for training and evaluating Natural Language Processing models. However, recent work has discovered that even widely-used benchmark datasets contain a substantial number of erroneous annotations. This problem has been addressed with Annotation Error Detection (AED) models, which can flag such errors for human re-annotation. However, even though many of these… ▽ More Manually annotated datasets are crucial for training and evaluating Natural Language Processing models. However, recent work has discovered that even widely-used benchmark datasets contain a substantial number of erroneous annotations. This problem has been addressed with Annotation Error Detection (AED) models, which can flag such errors for human re-annotation. However, even though many of these AED methods assume a final curation step in which a human annotator decides whether the annotation is erroneous, they have been developed as static models without any human-in-the-loop component. In this work, we propose ActiveAED, an AED method that can detect errors more accurately by repeatedly querying a human for error corrections in its prediction loop. We evaluate ActiveAED on eight datasets spanning five different tasks and find that it leads to improvements over the state of the art on seven of them, with gains of up to six percentage points in average precision. △ Less

Submitted 31 May, 2023; originally announced May 2023.

Comments: Findings of ACL 2023

arXiv:2305.10937 [pdf, ps, other]

The generalized Hierarchical Gaussian Filter

Authors: Lilian Aline Weber, Peter Thestrup Waade, Nicolas Legrand, Anna Hedvig Møller, Klaas Enno Stephan, Christoph Mathys

Abstract: Hierarchical Bayesian models of perception and learning feature prominently in contemporary cognitive neuroscience where, for example, they inform computational concepts of mental disorders. This includes predictive coding and hierarchical Gaussian filtering (HGF), which differ in the nature of hierarchical representations. Predictive coding assumes that higher levels in a given hierarchy influenc… ▽ More Hierarchical Bayesian models of perception and learning feature prominently in contemporary cognitive neuroscience where, for example, they inform computational concepts of mental disorders. This includes predictive coding and hierarchical Gaussian filtering (HGF), which differ in the nature of hierarchical representations. Predictive coding assumes that higher levels in a given hierarchy influence the state (value) of lower levels. In HGF, however, higher levels determine the rate of change at lower levels. Here, we extend the space of generative models underlying HGF to include a form of nonlinear hierarchical coupling between state values akin to predictive coding and artificial neural networks in general. We derive the update equations corresponding to this generalization of HGF and conceptualize them as connecting a network of (belief) nodes where parent nodes either predict the state of child nodes or their rate of change. This enables us to (1) create modular architectures with generic computational steps in each node of the network, and (2) disclose the hierarchical message passing implied by generalized HGF models and to compare this to comparable schemes under predictive coding. We find that the algorithmic architecture instantiated by the generalized HGF is largely compatible with that of predictive coding but extends it with some unique predictions which arise from precision and volatility related computations. Our developments enable highly flexible implementations of hierarchical Bayesian models for empirical data analysis and are available as open source software. △ Less

Submitted 18 May, 2023; originally announced May 2023.

arXiv:2303.03915 [pdf, other]

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Authors: Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, Jörg Frohberg, Mario Šaško, Quentin Lhoest, Angelina McMillan-Major, Gerard Dupont, Stella Biderman, Anna Rogers, Loubna Ben allal, Francesco De Toni, Giada Pistilli, Olivier Nguyen, Somaieh Nikpoor, Maraim Masoud, Pierre Colombo, Javier de la Rosa , et al. (29 additional authors not shown)

Abstract: As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the f… ▽ More As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus. △ Less

Submitted 7 March, 2023; originally announced March 2023.

Comments: NeurIPS 2022, Datasets and Benchmarks Track

ACM Class: I.2.7

arXiv:2211.12486 [pdf, other]

Shortcomings of Top-Down Randomization-Based Sanity Checks for Evaluations of Deep Neural Network Explanations

Authors: Alexander Binder, Leander Weber, Sebastian Lapuschkin, Grégoire Montavon, Klaus-Robert Müller, Wojciech Samek

Abstract: While the evaluation of explanations is an important step towards trustworthy models, it needs to be done carefully, and the employed metrics need to be well-understood. Specifically model randomization testing is often overestimated and regarded as a sole criterion for selecting or discarding certain explanation methods. To address shortcomings of this test, we start by observing an experimental… ▽ More While the evaluation of explanations is an important step towards trustworthy models, it needs to be done carefully, and the employed metrics need to be well-understood. Specifically model randomization testing is often overestimated and regarded as a sole criterion for selecting or discarding certain explanation methods. To address shortcomings of this test, we start by observing an experimental gap in the ranking of explanation methods between randomization-based sanity checks [1] and model output faithfulness measures (e.g. [25]). We identify limitations of model-randomization-based sanity checks for the purpose of evaluating explanations. Firstly, we show that uninformative attribution maps created with zero pixel-wise covariance easily achieve high scores in this type of checks. Secondly, we show that top-down model randomization preserves scales of forward pass activations with high probability. That is, channels with large activations have a high probility to contribute strongly to the output, even after randomization of the network on top of them. Hence, explanations after randomization can only be expected to differ to a certain extent. This explains the observed experimental gap. In summary, these results demonstrate the inadequacy of model-randomization-based sanity checks as a criterion to rank attribution methods. △ Less

Submitted 22 November, 2022; originally announced November 2022.

Comments: 23 pages

arXiv:2211.05100 [pdf, other]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License. △ Less

Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

arXiv:2206.15076 [pdf, other]

BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing

Authors: Jason Alan Fries, Leon Weber, Natasha Seelam, Gabriel Altay, Debajyoti Datta, Samuele Garda, Myungsun Kang, Ruisi Su, Wojciech Kusa, Samuel Cahyawijaya, Fabio Barth, Simon Ott, Matthias Samwald, Stephen Bach, Stella Biderman, Mario Sänger, Bo Wang, Alison Callahan, Daniel León Periñán, Théo Gigant, Patrick Haller, Jenny Chim, Jose David Posada, John Michael Giorgi, Karthik Rangasai Sivaraman , et al. (18 additional authors not shown)

Abstract: Training and evaluating language models increasingly requires the construction of meta-datasets --diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a diversity of novel pretraining tasks, highlighting the benefits of meta-dataset curation. While successful i… ▽ More Training and evaluating language models increasingly requires the construction of meta-datasets --diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a diversity of novel pretraining tasks, highlighting the benefits of meta-dataset curation. While successful in general-domain text, translating these data-centric approaches to biomedical language modeling remains challenging, as labeled biomedical datasets are significantly underrepresented in popular data hubs. To address this challenge, we introduce BigBIO a community library of 126+ biomedical NLP datasets, currently covering 12 task categories and 10+ languages. BigBIO facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata, and is compatible with current platforms for prompt engineering and end-to-end few/zero shot language model evaluation. We discuss our process for task schema harmonization, data auditing, contribution guidelines, and outline two illustrative use cases: zero-shot evaluation of biomedical prompts and large-scale, multi-task learning. BigBIO is an ongoing community effort and is available at https://github.com/bigscience-workshop/biomedical △ Less

Submitted 30 June, 2022; originally announced June 2022.

Comments: Submitted to NeurIPS 2022 Datasets and Benchmarks Track

arXiv:2205.01929 [pdf, other]

Explain to Not Forget: Defending Against Catastrophic Forgetting with XAI

Authors: Sami Ede, Serop Baghdadlian, Leander Weber, An Nguyen, Dario Zanca, Wojciech Samek, Sebastian Lapuschkin

Abstract: The ability to continuously process and retain new information like we do naturally as humans is a feat that is highly sought after when training neural networks. Unfortunately, the traditional optimization algorithms often require large amounts of data available during training time and updates wrt. new data are difficult after the training process has been completed. In fact, when new data or ta… ▽ More The ability to continuously process and retain new information like we do naturally as humans is a feat that is highly sought after when training neural networks. Unfortunately, the traditional optimization algorithms often require large amounts of data available during training time and updates wrt. new data are difficult after the training process has been completed. In fact, when new data or tasks arise, previous progress may be lost as neural networks are prone to catastrophic forgetting. Catastrophic forgetting describes the phenomenon when a neural network completely forgets previous knowledge when given new information. We propose a novel training algorithm called training by explaining in which we leverage Layer-wise Relevance Propagation in order to retain the information a neural network has already learned in previous tasks when training on new data. The method is evaluated on a range of benchmark datasets as well as more complex data. Our method not only successfully retains the knowledge of old tasks within the neural networks but does so more resource-efficiently than other state-of-the-art solutions. △ Less

Submitted 22 June, 2022; v1 submitted 4 May, 2022; originally announced May 2022.

Comments: 14 pages including appendix, 5 figures, 2 tables, 1 algorithm listing. v2 update increases figure readability, updates Fig 5 caption, adds our collaborators Dario and An as co-authors v3 brings the preprint in line with the final version accepted for peer-reviewed publication at CD-MAKE 2022. v4 metadata update

arXiv:2203.08008 [pdf, other]

Beyond Explaining: Opportunities and Challenges of XAI-Based Model Improvement

Authors: Leander Weber, Sebastian Lapuschkin, Alexander Binder, Wojciech Samek

Abstract: Explainable Artificial Intelligence (XAI) is an emerging research field bringing transparency to highly complex and opaque machine learning (ML) models. Despite the development of a multitude of methods to explain the decisions of black-box classifiers in recent years, these tools are seldomly used beyond visualization purposes. Only recently, researchers have started to employ explanations in pra… ▽ More Explainable Artificial Intelligence (XAI) is an emerging research field bringing transparency to highly complex and opaque machine learning (ML) models. Despite the development of a multitude of methods to explain the decisions of black-box classifiers in recent years, these tools are seldomly used beyond visualization purposes. Only recently, researchers have started to employ explanations in practice to actually improve models. This paper offers a comprehensive overview over techniques that apply XAI practically for improving various properties of ML models, and systematically categorizes these approaches, comparing their respective strengths and weaknesses. We provide a theoretical perspective on these methods, and show empirically through experiments on toy and realistic settings how explanations can help improve properties such as model generalization ability or reasoning, among others. We further discuss potential caveats and drawbacks of these methods. We conclude that while model improvement based on XAI can have significant beneficial effects even on complex and not easily quantifyable model properties, these methods need to be applied carefully, since their success can vary depending on a multitude of factors, such as the model and dataset used, or the employed explanation method. △ Less

Submitted 15 March, 2022; originally announced March 2022.

arXiv:2202.06861 [pdf, other]

Quantus: An Explainable AI Toolkit for Responsible Evaluation of Neural Network Explanations and Beyond

Authors: Anna Hedström, Leander Weber, Dilyara Bareeva, Daniel Krakowczyk, Franz Motzkus, Wojciech Samek, Sebastian Lapuschkin, Marina M. -C. Höhne

Abstract: The evaluation of explanation methods is a research topic that has not yet been explored deeply, however, since explainability is supposed to strengthen trust in artificial intelligence, it is necessary to systematically review and compare explanation methods in order to confirm their correctness. Until now, no tool with focus on XAI evaluation exists that exhaustively and speedily allows research… ▽ More The evaluation of explanation methods is a research topic that has not yet been explored deeply, however, since explainability is supposed to strengthen trust in artificial intelligence, it is necessary to systematically review and compare explanation methods in order to confirm their correctness. Until now, no tool with focus on XAI evaluation exists that exhaustively and speedily allows researchers to evaluate the performance of explanations of neural network predictions. To increase transparency and reproducibility in the field, we therefore built Quantus -- a comprehensive, evaluation toolkit in Python that includes a growing, well-organised collection of evaluation metrics and tutorials for evaluating explainable methods. The toolkit has been thoroughly tested and is available under an open-source license on PyPi (or on https://github.com/understandable-machine-intelligence-lab/Quantus/). △ Less

Submitted 27 April, 2023; v1 submitted 14 February, 2022; originally announced February 2022.

Comments: 4 pages, 1 figure, 1 table

Journal ref: Journal of Machine Learning Research, Vol. 24 (2023) 1-11

arXiv:2202.06621 [pdf, other]

Measurably Stronger Explanation Reliability via Model Canonization

Authors: Franz Motzkus, Leander Weber, Sebastian Lapuschkin

Abstract: While rule-based attribution methods have proven useful for providing local explanations for Deep Neural Networks, explaining modern and more varied network architectures yields new challenges in generating trustworthy explanations, since the established rule sets might not be sufficient or applicable to novel network structures. As an elegant solution to the above issue, network canonization has… ▽ More While rule-based attribution methods have proven useful for providing local explanations for Deep Neural Networks, explaining modern and more varied network architectures yields new challenges in generating trustworthy explanations, since the established rule sets might not be sufficient or applicable to novel network structures. As an elegant solution to the above issue, network canonization has recently been introduced. This procedure leverages the implementation-dependency of rule-based attributions and restructures a model into a functionally identical equivalent of alternative design to which established attribution rules can be applied. However, the idea of canonization and its usefulness have so far only been explored qualitatively. In this work, we quantitatively verify the beneficial effects of network canonization to rule-based attributions on VGG-16 and ResNet18 models with BatchNorm layers and thus extend the current best practices for obtaining reliable neural network explanations. △ Less

Submitted 14 February, 2022; originally announced February 2022.

Comments: 5 pages, 4 figures

arXiv:2202.03482 [pdf, other]

Navigating Neural Space: Revisiting Concept Activation Vectors to Overcome Directional Divergence

Authors: Frederik Pahde, Maximilian Dreyer, Leander Weber, Moritz Weckbecker, Christopher J. Anders, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin

Abstract: With a growing interest in understanding neural network prediction strategies, Concept Activation Vectors (CAVs) have emerged as a popular tool for modeling human-understandable concepts in the latent space. Commonly, CAVs are computed by leveraging linear classifiers optimizing the separability of latent representations of samples with and without a given concept. However, in this paper we show t… ▽ More With a growing interest in understanding neural network prediction strategies, Concept Activation Vectors (CAVs) have emerged as a popular tool for modeling human-understandable concepts in the latent space. Commonly, CAVs are computed by leveraging linear classifiers optimizing the separability of latent representations of samples with and without a given concept. However, in this paper we show that such a separability-oriented computation leads to solutions, which may diverge from the actual goal of precisely modeling the concept direction. This discrepancy can be attributed to the significant influence of distractor directions, i.e., signals unrelated to the concept, which are picked up by filters (i.e., weights) of linear models to optimize class-separability. To address this, we introduce pattern-based CAVs, solely focussing on concept signals, thereby providing more accurate concept directions. We evaluate various CAV methods in terms of their alignment with the true concept direction and their impact on CAV applications, including concept sensitivity testing and model correction for shortcut behavior caused by data artifacts. We demonstrate the benefits of pattern-based CAVs using the Pediatric Bone Age, ISIC2019, and FunnyBirds datasets with VGG, ResNet, and EfficientNet model architectures. △ Less

Submitted 5 February, 2024; v1 submitted 7 February, 2022; originally announced February 2022.

arXiv:2101.11287 [pdf, other]

Language Modelling as a Multi-Task Problem

Authors: Lucas Weber, Jaap Jumelet, Elia Bruni, Dieuwke Hupkes

Abstract: In this paper, we propose to study language modelling as a multi-task problem, bringing together three strands of research: multi-task learning, linguistics, and interpretability. Based on hypotheses derived from linguistic theory, we investigate whether language models adhere to learning principles of multi-task learning during training. To showcase the idea, we analyse the generalisation behavio… ▽ More In this paper, we propose to study language modelling as a multi-task problem, bringing together three strands of research: multi-task learning, linguistics, and interpretability. Based on hypotheses derived from linguistic theory, we investigate whether language models adhere to learning principles of multi-task learning during training. To showcase the idea, we analyse the generalisation behaviour of language models as they learn the linguistic concept of Negative Polarity Items (NPIs). Our experiments demonstrate that a multi-task setting naturally emerges within the objective of the more general task of language modelling.We argue that this insight is valuable for multi-task learning, linguistics and interpretability research and can lead to exciting new findings in all three domains. △ Less

Submitted 27 January, 2021; originally announced January 2021.

Comments: Accepted for publication at EACL 2021

arXiv:2011.00034 [pdf, other]

Adaptive Semi-Supervised Intent Inferral to Control a Powered Hand Orthosis for Stroke

Authors: Jingxi Xu, Cassie Meeker, Ava Chen, Lauren Winterbottom, Michaela Fraser, Sangwoo Park, Lynne M. Weber, Mitchell Miya, Dawn Nilsen, Joel Stein, Matei Ciocarlie

Abstract: In order to provide therapy in a functional context, controls for wearable robotic orthoses need to be robust and intuitive. We have previously introduced an intuitive, user-driven, EMG-based method to operate a robotic hand orthosis, but the process of training a control that is robust to concept drift (changes in the input signal) places a substantial burden on the user. In this paper, we explor… ▽ More In order to provide therapy in a functional context, controls for wearable robotic orthoses need to be robust and intuitive. We have previously introduced an intuitive, user-driven, EMG-based method to operate a robotic hand orthosis, but the process of training a control that is robust to concept drift (changes in the input signal) places a substantial burden on the user. In this paper, we explore semi-supervised learning as a paradigm for controlling a powered hand orthosis for stroke subjects. To the best of our knowledge, this is the first use of semi-supervised learning for an orthotic application. Specifically, we propose a disagreement-based semi-supervision algorithm for handling intrasession concept drift based on multimodal ipsilateral sensing. We evaluate the performance of our algorithm on data collected from five stroke subjects. Our results show that the proposed algorithm helps the device adapt to intrasession drift using unlabeled data and reduces the training burden placed on the user. We also validate the feasibility of our proposed algorithm with a functional task; in these experiments, two subjects successfully completed multiple instances of a pick-and-handover task. △ Less

Submitted 1 March, 2022; v1 submitted 30 October, 2020; originally announced November 2020.

Comments: 7 pages; Accepted to International Conference on Robotics and Automation (ICRA) 2022

arXiv:2008.07347 [pdf, other]

HunFlair: An Easy-to-Use Tool for State-of-the-Art Biomedical Named Entity Recognition

Authors: Leon Weber, Mario Sänger, Jannes Münchmeyer, Maryam Habibi, Ulf Leser, Alan Akbik

Abstract: Summary: Named Entity Recognition (NER) is an important step in biomedical information extraction pipelines. Tools for NER should be easy to use, cover multiple entity types, highly accurate, and robust towards variations in text genre and style. To this end, we propose HunFlair, an NER tagger covering multiple entity types integrated into the widely used NLP framework Flair. HunFlair outperforms… ▽ More Summary: Named Entity Recognition (NER) is an important step in biomedical information extraction pipelines. Tools for NER should be easy to use, cover multiple entity types, highly accurate, and robust towards variations in text genre and style. To this end, we propose HunFlair, an NER tagger covering multiple entity types integrated into the widely used NLP framework Flair. HunFlair outperforms other state-of-the-art standalone NER tools with an average gain of 7.26 pp over the next best tool, can be installed with a single command and is applied with only four lines of code. Availability: HunFlair is freely available through the Flair framework under an MIT license: https://github.com/flairNLP/flair and is compatible with all major operating systems. Contact:{weberple,saengema,alan.akbik}@informatik.hu-berlin.de △ Less

Submitted 18 August, 2020; v1 submitted 17 August, 2020; originally announced August 2020.

Comments: - Corrected author list - Updated project link

arXiv:2004.10484 [pdf, other]

doi 10.1109/ICPR48806.2021.9413242

Understanding Integrated Gradients with SmoothTaylor for Deep Neural Network Attribution

Authors: Gary S. W. Goh, Sebastian Lapuschkin, Leander Weber, Wojciech Samek, Alexander Binder

Abstract: Integrated Gradients as an attribution method for deep neural network models offers simple implementability. However, it suffers from noisiness of explanations which affects the ease of interpretability. The SmoothGrad technique is proposed to solve the noisiness issue and smoothen the attribution maps of any gradient-based attribution method. In this paper, we present SmoothTaylor as a novel theo… ▽ More Integrated Gradients as an attribution method for deep neural network models offers simple implementability. However, it suffers from noisiness of explanations which affects the ease of interpretability. The SmoothGrad technique is proposed to solve the noisiness issue and smoothen the attribution maps of any gradient-based attribution method. In this paper, we present SmoothTaylor as a novel theoretical concept bridging Integrated Gradients and SmoothGrad, from the Taylor's theorem perspective. We apply the methods to the image classification problem, using the ILSVRC2012 ImageNet object recognition dataset, and a couple of pretrained image models to generate attribution maps. These attribution maps are empirically evaluated using quantitative measures for sensitivity and noise level. We further propose adaptive noising to optimize for the noise scale hyperparameter value. From our experiments, we find that the SmoothTaylor approach together with adaptive noising is able to generate better quality saliency maps with lesser noise and higher sensitivity to the relevant points in the input space as compared to Integrated Gradients. △ Less

Submitted 2 September, 2021; v1 submitted 22 April, 2020; originally announced April 2020.

Comments: 8 pages, 3 figures. Accepted in 25th International Conference on Pattern Recognition, (ICPR) 2020. In Proceedings: pp. 4949-4956

arXiv:2002.06161 [pdf]

doi 10.1186/s12859-020-03928-1

menoci: Lightweight Extensible Web Portal enabling FAIR Data Management for Biomedical Research Projects

Authors: Markus Suhr, Christoph Lehmann, Christian Robert Bauer, Theresa Bender, Cornelius Knopp, Luca Freckmann, Björn Öst Hansen, Christian Henke, Georg Aschenbrandt, Lea Kühlborn, Sophia Rheinländer, Linus Weber, Bartlomiej Marzec, Marcel Hellkamp, Philipp Wieder, Harald Kusch, Ulrich Sax, Sara Yasemin Nussbeck

Abstract: Background: Biomedical research projects deal with data management requirements from multiple sources like funding agencies' guidelines, publisher policies, discipline best practices, and their own users' needs. We describe functional and quality requirements based on many years of experience implementing data management for the CRC 1002 and CRC 1190. A fully equipped data management software shou… ▽ More Background: Biomedical research projects deal with data management requirements from multiple sources like funding agencies' guidelines, publisher policies, discipline best practices, and their own users' needs. We describe functional and quality requirements based on many years of experience implementing data management for the CRC 1002 and CRC 1190. A fully equipped data management software should improve documentation of experiments and materials, enable data storage and sharing according to the FAIR Guiding Principles while maximizing usability, information security, as well as software sustainability and reusability. Results: We introduce the modular web portal software menoci for data collection, experiment documentation, data publication, sharing, and preservation in biomedical research projects. Menoci modules are based on the Drupal content management system which enables lightweight deployment and setup, and creates the possibility to combine research data management with a customisable project home page or collaboration platform. Conclusions: Management of research data and digital research artefacts is transforming from individual researcher or groups best practices towards project- or organisation-wide service infrastructures. To enable and support this structural transformation process, a vital ecosystem of open source software tools is needed. Menoci is a contribution to this ecosystem of research data management tools that is specifically designed to support biomedical research projects. △ Less

Submitted 7 February, 2020; originally announced February 2020.

Comments: Preprint. 19 pages, 2 figures

Journal ref: BMC Bioinformatics 21, 582 (2020)

arXiv:1912.11425 [pdf, other]

Finding and Removing Clever Hans: Using Explanation Methods to Debug and Improve Deep Models

Authors: Christopher J. Anders, Leander Weber, David Neumann, Wojciech Samek, Klaus-Robert Müller, Sebastian Lapuschkin

Abstract: Contemporary learning models for computer vision are typically trained on very large (benchmark) datasets with millions of samples. These may, however, contain biases, artifacts, or errors that have gone unnoticed and are exploitable by the model. In the worst case, the trained model does not learn a valid and generalizable strategy to solve the problem it was trained for, and becomes a 'Clever-Ha… ▽ More Contemporary learning models for computer vision are typically trained on very large (benchmark) datasets with millions of samples. These may, however, contain biases, artifacts, or errors that have gone unnoticed and are exploitable by the model. In the worst case, the trained model does not learn a valid and generalizable strategy to solve the problem it was trained for, and becomes a 'Clever-Hans' (CH) predictor that bases its decisions on spurious correlations in the training data, potentially yielding an unrepresentative or unfair, and possibly even hazardous predictor. In this paper, we contribute by providing a comprehensive analysis framework based on a scalable statistical analysis of attributions from explanation methods for large data corpora. Based on a recent technique - Spectral Relevance Analysis - we propose the following technical contributions and resulting findings: (a) a scalable quantification of artifactual and poisoned classes where the machine learning models under study exhibit CH behavior, (b) several approaches denoted as Class Artifact Compensation (ClArC), which are able to effectively and significantly reduce a model's CH behavior. I.e., we are able to un-Hans models trained on (poisoned) datasets, such as the popular ImageNet data corpus. We demonstrate that ClArC, defined in a simple theoretical framework, may be implemented as part of a Neural Network's training or fine-tuning process, or in a post-hoc manner by injecting additional layers, preventing any further propagation of undesired CH features, into the network architecture. Using our proposed methods, we provide qualitative and quantitative analyses of the biases and artifacts in various datasets. We demonstrate that these insights can give rise to improved, more representative and fairer models operating on implicitly cleaned data corpora. △ Less

Submitted 18 December, 2020; v1 submitted 22 December, 2019; originally announced December 2019.

Comments: 47 pages, 21 figures

arXiv:1911.08003 [pdf, other]

doi 10.1109/TNSRE.2020.3021691

User-Driven Functional Movement Training with a Wearable Hand Robot after Stroke

Authors: Sangwoo Park, Michaela Fraser, Lynne M. Weber, Cassie Meeker, Lauri Bishop, Daniel Geller, Joel Stein, Matei Ciocarlie

Abstract: We studied the performance of a robotic orthosis designed to assist the paretic hand after stroke. It is wearable and fully user-controlled, serving two possible roles: as a therapeutic tool that facilitates device mediated hand exercises to recover neuromuscular function or as an assistive device for use in everyday activities to aid functional use of the hand. We present the clinical outcomes of… ▽ More We studied the performance of a robotic orthosis designed to assist the paretic hand after stroke. It is wearable and fully user-controlled, serving two possible roles: as a therapeutic tool that facilitates device mediated hand exercises to recover neuromuscular function or as an assistive device for use in everyday activities to aid functional use of the hand. We present the clinical outcomes of a pilot study designed as a feasibility test for these hypotheses. 11 chronic stroke (> 2 years) patients with moderate muscle tone (Modified Ashworth Scale less than or equal to 2 in upper extremity) engaged in a month-long training protocol using the orthosis. Individuals were evaluated using standardized outcome measures, both with and without orthosis assistance. Fugl-Meyer post intervention scores without robotic assistance showed improvement focused specifically at the distal joints of the upper limb, suggesting the use of the orthosis as a rehabilitative device for the hand. Action Research Arm Test scores post intervention with robotic assistance showed that the device may serve an assistive role in grasping tasks. These results highlight the potential for wearable and user-driven robotic hand orthoses to extend the use and training of the affected upper limb after stroke. △ Less

Submitted 2 September, 2020; v1 submitted 18 November, 2019; originally announced November 2019.

Comments: 10 pages, 15 figures, Camera ready version

arXiv:1906.06187 [pdf, other]

NLProlog: Reasoning with Weak Unification for Question Answering in Natural Language

Authors: Leon Weber, Pasquale Minervini, Jannes Münchmeyer, Ulf Leser, Tim Rocktäschel

Abstract: Rule-based models are attractive for various tasks because they inherently lead to interpretable and explainable decisions and can easily incorporate prior knowledge. However, such systems are difficult to apply to problems involving natural language, due to its linguistic variability. In contrast, neural models can cope very well with ambiguity by learning distributed representations of words and… ▽ More Rule-based models are attractive for various tasks because they inherently lead to interpretable and explainable decisions and can easily incorporate prior knowledge. However, such systems are difficult to apply to problems involving natural language, due to its linguistic variability. In contrast, neural models can cope very well with ambiguity by learning distributed representations of words and their composition from data, but lead to models that are difficult to interpret. In this paper, we describe a model combining neural networks with logic programming in a novel manner for solving multi-hop reasoning tasks over natural language. Specifically, we propose to use a Prolog prover which we extend to utilize a similarity function over pretrained sentence encoders. We fine-tune the representations for the similarity function via backpropagation. This leads to a system that can apply rule-based reasoning to natural language, and induce domain-specific rules from training data. We evaluate the proposed system on two different question answering tasks, showing that it outperforms two baselines -- BIDAF (Seo et al., 2016a) and FAST QA (Weissenborn et al., 2017b) on a subset of the WikiHop corpus and achieves competitive results on the MedHop data set (Welbl et al., 2017). △ Less

Submitted 14 June, 2019; originally announced June 2019.

Comments: ACL 2019

arXiv:1904.08368 [pdf, other]

Relay: A High-Level Compiler for Deep Learning

Authors: Jared Roesch, Steven Lyubomirsky, Marisa Kirisame, Logan Weber, Josh Pollock, Luis Vega, Ziheng Jiang, Tianqi Chen, Thierry Moreau, Zachary Tatlock

Abstract: Frameworks for writing, compiling, and optimizing deep learning (DL) models have recently enabled progress in areas like computer vision and natural language processing. Extending these frameworks to accommodate the rapidly diversifying landscape of DL models and hardware platforms presents challenging tradeoffs between expressivity, composability, and portability. We present Relay, a new compiler… ▽ More Frameworks for writing, compiling, and optimizing deep learning (DL) models have recently enabled progress in areas like computer vision and natural language processing. Extending these frameworks to accommodate the rapidly diversifying landscape of DL models and hardware platforms presents challenging tradeoffs between expressivity, composability, and portability. We present Relay, a new compiler framework for DL. Relay's functional, statically typed intermediate representation (IR) unifies and generalizes existing DL IRs to express state-of-the-art models. The introduction of Relay's expressive IR requires careful design of domain-specific optimizations, addressed via Relay's extension mechanisms. Using these extension mechanisms, Relay supports a unified compiler that can target a variety of hardware platforms. Our evaluation demonstrates Relay's competitive performance for a broad class of models and devices (CPUs, GPUs, and emerging accelerators). Relay's design demonstrates how a unified IR can provide expressivity, composability, and portability without compromising performance. △ Less

Submitted 24 August, 2019; v1 submitted 17 April, 2019; originally announced April 2019.

arXiv:1810.00952 [pdf, other]

doi 10.1145/3211346.3211348

Relay: A New IR for Machine Learning Frameworks

Authors: Jared Roesch, Steven Lyubomirsky, Logan Weber, Josh Pollock, Marisa Kirisame, Tianqi Chen, Zachary Tatlock

Abstract: Machine learning powers diverse services in industry including search, translation, recommendation systems, and security. The scale and importance of these models require that they be efficient, expressive, and portable across an array of heterogeneous hardware devices. These constraints are often at odds; in order to better accommodate them we propose a new high-level intermediate representation… ▽ More Machine learning powers diverse services in industry including search, translation, recommendation systems, and security. The scale and importance of these models require that they be efficient, expressive, and portable across an array of heterogeneous hardware devices. These constraints are often at odds; in order to better accommodate them we propose a new high-level intermediate representation (IR) called Relay. Relay is being designed as a purely-functional, statically-typed language with the goal of balancing efficient compilation, expressiveness, and portability. We discuss the goals of Relay and highlight its important design constraints. Our prototype is part of the open source NNVM compiler framework, which powers Amazon's deep learning framework MxNet. △ Less

Submitted 25 September, 2018; originally announced October 2018.

arXiv:1808.00092 [pdf, other]

doi 10.1109/LRA.2018.2890199

Multimodal Sensing and Interaction for a Robotic Hand Orthosis

Authors: Sangwoo Park, Cassie Meeker, Lynne M. Weber, Lauri Bishop, Joel Stein, Matei Ciocarlie

Abstract: Wearable robotic hand rehabilitation devices can allow greater freedom and flexibility than their workstation-like counterparts. However, the field is generally lacking effective methods by which the user can operate the device: such controls must be effective, intuitive, and robust to the wide range of possible impairment patterns. Even when focusing on a specific condition, such as stroke, the v… ▽ More Wearable robotic hand rehabilitation devices can allow greater freedom and flexibility than their workstation-like counterparts. However, the field is generally lacking effective methods by which the user can operate the device: such controls must be effective, intuitive, and robust to the wide range of possible impairment patterns. Even when focusing on a specific condition, such as stroke, the variety of encountered upper limb impairment patterns means that a single sensing modality, such as electromyography (EMG), might not be sufficient to enable controls for a broad range of users. To address this significant gap, we introduce a multimodal sensing and interaction paradigm for an active hand orthosis. In our proof-of-concept implementation, EMG is complemented by other sensing modalities, such as finger bend and contact pressure sensors. We propose multimodal interaction methods that utilize this sensory data as input, and show they can enable tasks for stroke survivors who exhibit different impairment patterns. We believe that robotic hand orthoses developed as multimodal sensory platforms with help address some of the key challenges in physical interaction with the user. △ Less

Submitted 18 December, 2018; v1 submitted 31 July, 2018; originally announced August 2018.

Comments: 8 pages. 9 Figures. IEEE Robotics and Automation Letters. Preprint version. Accepted Dec, 2018

Journal ref: IEEE Robotics and Automation Letters, Vol. 4, No. 2, pp. 315 - 322, April 2019

arXiv:1804.02462 [pdf, other]

Human Robot Interface for Assistive Grasping

Authors: David Watkins, Chaiwen Chou, Caroline Weinberg, Jacob Varley, Kenneth Lyons, Sanjay Joshi, Lynne Weber, Joel Stein, Peter Allen

Abstract: This work describes a new human-in-the-loop (HitL) assistive grasping system for individuals with varying levels of physical capabilities. We investigated the feasibility of using four potential input devices with our assistive grasping system interface, using able-bodied individuals to define a set of quantitative metrics that could be used to assess an assistive grasping system. We then took the… ▽ More This work describes a new human-in-the-loop (HitL) assistive grasping system for individuals with varying levels of physical capabilities. We investigated the feasibility of using four potential input devices with our assistive grasping system interface, using able-bodied individuals to define a set of quantitative metrics that could be used to assess an assistive grasping system. We then took these measurements and created a generalized benchmark for evaluating the effectiveness of any arbitrary input device into a HitL grasping system. The four input devices were a mouse, a speech recognition device, an assistive switch, and a novel sEMG device developed by our group that was connected either to the forearm or behind the ear of the subject. These preliminary results provide insight into how different interface devices perform for generalized assistive grasping tasks and also highlight the potential of sEMG based control for severely disabled individuals. △ Less

Submitted 6 April, 2018; originally announced April 2018.

Comments: 8 pages, 21 figures

arXiv:1802.06131 [pdf, other]

Design and Development of Effective Transmission Mechanisms on a Tendon Driven Hand Orthosis for Stroke Patients

Authors: Sangwoo Park, Lynne M. Weber, Lauri Bishop, Joel Stein, Matei Ciocarlie

Abstract: Tendon-driven hand orthoses have advantages over exoskeletons with respect to wearability and safety because of their low-profile design and ability to fit a range of patients without requiring custom joint alignment. However, no existing study on a wearable tendon-driven hand orthosis for stroke patients presents evidence that such devices can overcome spasticity given repeated use and fatigue, o… ▽ More Tendon-driven hand orthoses have advantages over exoskeletons with respect to wearability and safety because of their low-profile design and ability to fit a range of patients without requiring custom joint alignment. However, no existing study on a wearable tendon-driven hand orthosis for stroke patients presents evidence that such devices can overcome spasticity given repeated use and fatigue, or discusses transmission efficiency. In this study, we propose two designs that provide effective force transmission by increasing moment arms around finger joints. We evaluate the designs with geometric models and experiment using a 3D-printed artificial finger to find force and joint angle characteristics of the suggested structures. We also perform clinical tests with stroke patients to demonstrate the feasibility of the designs. The testing supports the hypothesis that the proposed designs efficiently elicit extension of the digits in patients with spasticity as compared to existing baselines. △ Less

Submitted 31 July, 2018; v1 submitted 16 February, 2018; originally announced February 2018.

Comments: 7 pages, ICRA 2018

Showing 1–36 of 36 results for author: Weber, L