-
Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation
Authors:
Ryutaro Tanno,
David G. T. Barrett,
Andrew Sellergren,
Sumedh Ghaisas,
Sumanth Dathathri,
Abigail See,
Johannes Welbl,
Karan Singhal,
Shekoofeh Azizi,
Tao Tu,
Mike Schaekermann,
Rhys May,
Roy Lee,
SiWai Man,
Zahra Ahmed,
Sara Mahdavi,
Yossi Matias,
Joelle Barral,
Ali Eslami,
Danielle Belgrave,
Vivek Natarajan,
Shravya Shetty,
Pushmeet Kohli,
Po-Sen Huang,
Alan Karthikesalingam
, et al. (1 additional authors not shown)
Abstract:
Radiology reports are an instrumental part of modern medicine, informing key clinical decisions such as diagnosis and treatment. The worldwide shortage of radiologists, however, restricts access to expert care and imposes heavy workloads, contributing to avoidable errors and delays in report delivery. While recent progress in automated report generation with vision-language models offer clear pote…
▽ More
Radiology reports are an instrumental part of modern medicine, informing key clinical decisions such as diagnosis and treatment. The worldwide shortage of radiologists, however, restricts access to expert care and imposes heavy workloads, contributing to avoidable errors and delays in report delivery. While recent progress in automated report generation with vision-language models offer clear potential in ameliorating the situation, the path to real-world adoption has been stymied by the challenge of evaluating the clinical quality of AI-generated reports. In this study, we build a state-of-the-art report generation system for chest radiographs, $\textit{Flamingo-CXR}$, by fine-tuning a well-known vision-language foundation model on radiology data. To evaluate the quality of the AI-generated reports, a group of 16 certified radiologists provide detailed evaluations of AI-generated and human written reports for chest X-rays from an intensive care setting in the United States and an inpatient setting in India. At least one radiologist (out of two per case) preferred the AI report to the ground truth report in over 60$\%$ of cases for both datasets. Amongst the subset of AI-generated reports that contain errors, the most frequently cited reasons were related to the location and finding, whereas for human written reports, most mistakes were related to severity and finding. This disparity suggested potential complementarity between our AI system and human experts, prompting us to develop an assistive scenario in which Flamingo-CXR generates a first-draft report, which is subsequently revised by a clinician. This is the first demonstration of clinician-AI collaboration for report writing, and the resultant reports are assessed to be equivalent or preferred by at least one radiologist to reports written by experts alone in 80$\%$ of in-patient cases and 60$\%$ of intensive care cases.
△ Less
Submitted 20 December, 2023; v1 submitted 30 November, 2023;
originally announced November 2023.
-
Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models
Authors:
Maribeth Rauh,
John Mellor,
Jonathan Uesato,
Po-Sen Huang,
Johannes Welbl,
Laura Weidinger,
Sumanth Dathathri,
Amelia Glaese,
Geoffrey Irving,
Iason Gabriel,
William Isaac,
Lisa Anne Hendricks
Abstract:
Large language models produce human-like text that drive a growing number of applications. However, recent literature and, increasingly, real world observations, have demonstrated that these models can generate language that is toxic, biased, untruthful or otherwise harmful. Though work to evaluate language model harms is under way, translating foresight about which harms may arise into rigorous b…
▽ More
Large language models produce human-like text that drive a growing number of applications. However, recent literature and, increasingly, real world observations, have demonstrated that these models can generate language that is toxic, biased, untruthful or otherwise harmful. Though work to evaluate language model harms is under way, translating foresight about which harms may arise into rigorous benchmarks is not straightforward. To facilitate this translation, we outline six ways of characterizing harmful text which merit explicit consideration when designing new benchmarks. We then use these characteristics as a lens to identify trends and gaps in existing benchmarks. Finally, we apply them in a case study of the Perspective API, a toxicity classifier that is widely used in harm benchmarks. Our characteristics provide one piece of the bridge that translates between foresight and effective evaluation.
△ Less
Submitted 28 October, 2022; v1 submitted 16 June, 2022;
originally announced June 2022.
-
Training Compute-Optimal Large Language Models
Authors:
Jordan Hoffmann,
Sebastian Borgeaud,
Arthur Mensch,
Elena Buchatskaya,
Trevor Cai,
Eliza Rutherford,
Diego de Las Casas,
Lisa Anne Hendricks,
Johannes Welbl,
Aidan Clark,
Tom Hennigan,
Eric Noland,
Katie Millican,
George van den Driessche,
Bogdan Damoc,
Aurelia Guy,
Simon Osindero,
Karen Simonyan,
Erich Elsen,
Jack W. Rae,
Oriol Vinyals,
Laurent Sifre
Abstract:
We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion…
▽ More
We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4$\times$ more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher.
△ Less
Submitted 29 March, 2022;
originally announced March 2022.
-
Competition-Level Code Generation with AlphaCode
Authors:
Yujia Li,
David Choi,
Junyoung Chung,
Nate Kushman,
Julian Schrittwieser,
Rémi Leblond,
Tom Eccles,
James Keeling,
Felix Gimeno,
Agustin Dal Lago,
Thomas Hubert,
Peter Choy,
Cyprien de Masson d'Autume,
Igor Babuschkin,
Xinyun Chen,
Po-Sen Huang,
Johannes Welbl,
Sven Gowal,
Alexey Cherepanov,
James Molloy,
Daniel J. Mankowitz,
Esme Sutherland Robson,
Pushmeet Kohli,
Nando de Freitas,
Koray Kavukcuoglu
, et al. (1 additional authors not shown)
Abstract:
Programming is a powerful and ubiquitous problem-solving tool. Developing systems that can assist programmers or even generate programs independently could make programming more productive and accessible, yet so far incorporating innovations in AI has proven challenging. Recent large-scale language models have demonstrated an impressive ability to generate code, and are now able to complete simple…
▽ More
Programming is a powerful and ubiquitous problem-solving tool. Developing systems that can assist programmers or even generate programs independently could make programming more productive and accessible, yet so far incorporating innovations in AI has proven challenging. Recent large-scale language models have demonstrated an impressive ability to generate code, and are now able to complete simple programming tasks. However, these models still perform poorly when evaluated on more complex, unseen problems that require problem-solving skills beyond simply translating instructions into code. For example, competitive programming problems which require an understanding of algorithms and complex natural language remain extremely challenging. To address this gap, we introduce AlphaCode, a system for code generation that can create novel solutions to these problems that require deeper reasoning. In simulated evaluations on recent programming competitions on the Codeforces platform, AlphaCode achieved on average a ranking of top 54.3% in competitions with more than 5,000 participants. We found that three key components were critical to achieve good and reliable performance: (1) an extensive and clean competitive programming dataset for training and evaluation, (2) large and efficient-to-sample transformer-based architectures, and (3) large-scale model sampling to explore the search space, followed by filtering based on program behavior to a small set of submissions.
△ Less
Submitted 8 February, 2022;
originally announced March 2022.
-
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Authors:
Jack W. Rae,
Sebastian Borgeaud,
Trevor Cai,
Katie Millican,
Jordan Hoffmann,
Francis Song,
John Aslanides,
Sarah Henderson,
Roman Ring,
Susannah Young,
Eliza Rutherford,
Tom Hennigan,
Jacob Menick,
Albin Cassirer,
Richard Powell,
George van den Driessche,
Lisa Anne Hendricks,
Maribeth Rauh,
Po-Sen Huang,
Amelia Glaese,
Johannes Welbl,
Sumanth Dathathri,
Saffron Huang,
Jonathan Uesato,
John Mellor
, et al. (55 additional authors not shown)
Abstract:
Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gop…
▽ More
Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.
△ Less
Submitted 21 January, 2022; v1 submitted 8 December, 2021;
originally announced December 2021.
-
Challenges in Detoxifying Language Models
Authors:
Johannes Welbl,
Amelia Glaese,
Jonathan Uesato,
Sumanth Dathathri,
John Mellor,
Lisa Anne Hendricks,
Kirsty Anderson,
Pushmeet Kohli,
Ben Coppin,
Po-Sen Huang
Abstract:
Large language models (LM) generate remarkably fluent text and can be efficiently adapted across NLP tasks. Measuring and guaranteeing the quality of generated text in terms of safety is imperative for deploying LMs in the real world; to this end, prior work often relies on automatic evaluation of LM toxicity. We critically discuss this approach, evaluate several toxicity mitigation strategies wit…
▽ More
Large language models (LM) generate remarkably fluent text and can be efficiently adapted across NLP tasks. Measuring and guaranteeing the quality of generated text in terms of safety is imperative for deploying LMs in the real world; to this end, prior work often relies on automatic evaluation of LM toxicity. We critically discuss this approach, evaluate several toxicity mitigation strategies with respect to both automatic and human evaluation, and analyze consequences of toxicity mitigation in terms of model bias and LM quality. We demonstrate that while basic intervention strategies can effectively optimize previously established automatic metrics on the RealToxicityPrompts dataset, this comes at the cost of reduced LM coverage for both texts about, and dialects of, marginalized groups. Additionally, we find that human raters often disagree with high automatic toxicity scores after strong toxicity reduction interventions -- highlighting further the nuances involved in careful evaluation of LM toxicity.
△ Less
Submitted 15 September, 2021;
originally announced September 2021.
-
Evaluating the Apperception Engine
Authors:
Richard Evans,
Jose Hernandez-Orallo,
Johannes Welbl,
Pushmeet Kohli,
Marek Sergot
Abstract:
The Apperception Engine is an unsupervised learning system. Given a sequence of sensory inputs, it constructs a symbolic causal theory that both explains the sensory sequence and also satisfies a set of unity conditions. The unity conditions insist that the constituents of the theory - objects, properties, and laws - must be integrated into a coherent whole. Once a theory has been constructed, it…
▽ More
The Apperception Engine is an unsupervised learning system. Given a sequence of sensory inputs, it constructs a symbolic causal theory that both explains the sensory sequence and also satisfies a set of unity conditions. The unity conditions insist that the constituents of the theory - objects, properties, and laws - must be integrated into a coherent whole. Once a theory has been constructed, it can be applied to predict future sensor readings, retrodict earlier readings, or impute missing readings.
In this paper, we evaluate the Apperception Engine in a diverse variety of domains, including cellular automata, rhythms and simple nursery tunes, multi-modal binding problems, occlusion tasks, and sequence induction intelligence tests. In each domain, we test our engine's ability to predict future sensor values, retrodict earlier sensor values, and impute missing sensory data. The engine performs well in all these domains, significantly outperforming neural net baselines and state of the art inductive logic programming systems. These results are significant because neural nets typically struggle to solve the binding problem (where information from different modalities must somehow be combined together into different aspects of one unified object) and fail to solve occlusion tasks (in which objects are sometimes visible and sometimes obscured from view). We note in particular that in the sequence induction intelligence tests, our system achieved human-level performance. This is notable because our system is not a bespoke system designed specifically to solve intelligence tests, but a general-purpose system that was designed to make sense of any sensory sequence.
△ Less
Submitted 9 July, 2020;
originally announced July 2020.
-
Undersensitivity in Neural Reading Comprehension
Authors:
Johannes Welbl,
Pasquale Minervini,
Max Bartolo,
Pontus Stenetorp,
Sebastian Riedel
Abstract:
Current reading comprehension models generalise well to in-distribution test sets, yet perform poorly on adversarially selected inputs. Most prior work on adversarial inputs studies oversensitivity: semantically invariant text perturbations that cause a model's prediction to change when it should not. In this work we focus on the complementary problem: excessive prediction undersensitivity, where…
▽ More
Current reading comprehension models generalise well to in-distribution test sets, yet perform poorly on adversarially selected inputs. Most prior work on adversarial inputs studies oversensitivity: semantically invariant text perturbations that cause a model's prediction to change when it should not. In this work we focus on the complementary problem: excessive prediction undersensitivity, where input text is meaningfully changed but the model's prediction does not, even though it should. We formulate a noisy adversarial attack which searches among semantic variations of the question for which a model erroneously predicts the same answer, and with even higher probability. Despite comprising unanswerable questions, both SQuAD2.0 and NewsQA models are vulnerable to this attack. This indicates that although accurate, models tend to rely on spurious patterns and do not fully consider the information specified in a question. We experiment with data augmentation and adversarial training as defences, and find that both substantially decrease vulnerability to attacks on held out data, as well as held out attack spaces. Addressing undersensitivity also improves results on AddSent and AddOneSent, and models furthermore generalise better when facing train/evaluation distribution mismatch: they are less prone to overly rely on predictive cues present only in the training set, and outperform a conventional model by as much as 10.9% F1.
△ Less
Submitted 15 February, 2020;
originally announced March 2020.
-
Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension
Authors:
Max Bartolo,
Alastair Roberts,
Johannes Welbl,
Sebastian Riedel,
Pontus Stenetorp
Abstract:
Innovations in annotation methodology have been a catalyst for Reading Comprehension (RC) datasets and models. One recent trend to challenge current RC models is to involve a model in the annotation process: humans create questions adversarially, such that the model fails to answer them correctly. In this work we investigate this annotation methodology and apply it in three different settings, col…
▽ More
Innovations in annotation methodology have been a catalyst for Reading Comprehension (RC) datasets and models. One recent trend to challenge current RC models is to involve a model in the annotation process: humans create questions adversarially, such that the model fails to answer them correctly. In this work we investigate this annotation methodology and apply it in three different settings, collecting a total of 36,000 samples with progressively stronger models in the annotation loop. This allows us to explore questions such as the reproducibility of the adversarial effect, transfer from data collected with varying model-in-the-loop strengths, and generalisation to data collected without a model. We find that training on adversarially collected samples leads to strong generalisation to non-adversarially collected datasets, yet with progressive performance deterioration with increasingly stronger models-in-the-loop. Furthermore, we find that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop. When trained on data collected with a BiDAF model in the loop, RoBERTa achieves 39.9F1 on questions that it cannot answer when trained on SQuAD - only marginally lower than when trained on data collected using RoBERTa itself (41.0F1).
△ Less
Submitted 22 September, 2020; v1 submitted 1 February, 2020;
originally announced February 2020.
-
Reducing Sentiment Bias in Language Models via Counterfactual Evaluation
Authors:
Po-Sen Huang,
Huan Zhang,
Ray Jiang,
Robert Stanforth,
Johannes Welbl,
Jack Rae,
Vishal Maini,
Dani Yogatama,
Pushmeet Kohli
Abstract:
Advances in language modeling architectures and the availability of large text corpora have driven progress in automatic text generation. While this results in models capable of generating coherent texts, it also prompts models to internalize social biases present in the training corpus. This paper aims to quantify and reduce a particular type of bias exhibited by language models: bias in the sent…
▽ More
Advances in language modeling architectures and the availability of large text corpora have driven progress in automatic text generation. While this results in models capable of generating coherent texts, it also prompts models to internalize social biases present in the training corpus. This paper aims to quantify and reduce a particular type of bias exhibited by language models: bias in the sentiment of generated text. Given a conditioning context (e.g., a writing prompt) and a language model, we analyze if (and how) the sentiment of the generated text is affected by changes in values of sensitive attributes (e.g., country names, occupations, genders) in the conditioning context using a form of counterfactual evaluation. We quantify sentiment bias by adopting individual and group fairness metrics from the fair machine learning literature, and demonstrate that large-scale models trained on two different corpora (news articles, and Wikipedia) exhibit considerable levels of bias. We then propose embedding and sentiment prediction-derived regularization on the language model's latent representations. The regularizations improve fairness metrics while retaining comparable levels of perplexity and semantic similarity.
△ Less
Submitted 8 October, 2020; v1 submitted 8 November, 2019;
originally announced November 2019.
-
Making sense of sensory input
Authors:
Richard Evans,
Jose Hernandez-Orallo,
Johannes Welbl,
Pushmeet Kohli,
Marek Sergot
Abstract:
This paper attempts to answer a central question in unsupervised learning: what does it mean to "make sense" of a sensory sequence? In our formalization, making sense involves constructing a symbolic causal theory that both explains the sensory sequence and also satisfies a set of unity conditions. The unity conditions insist that the constituents of the causal theory -- objects, properties, and l…
▽ More
This paper attempts to answer a central question in unsupervised learning: what does it mean to "make sense" of a sensory sequence? In our formalization, making sense involves constructing a symbolic causal theory that both explains the sensory sequence and also satisfies a set of unity conditions. The unity conditions insist that the constituents of the causal theory -- objects, properties, and laws -- must be integrated into a coherent whole. On our account, making sense of sensory input is a type of program synthesis, but it is unsupervised program synthesis.
Our second contribution is a computer implementation, the Apperception Engine, that was designed to satisfy the above requirements. Our system is able to produce interpretable human-readable causal theories from very small amounts of data, because of the strong inductive bias provided by the unity conditions. A causal theory produced by our system is able to predict future sensor readings, as well as retrodict earlier readings, and impute (fill in the blanks of) missing sensory readings, in any combination.
We tested the engine in a diverse variety of domains, including cellular automata, rhythms and simple nursery tunes, multi-modal binding problems, occlusion tasks, and sequence induction intelligence tests. In each domain, we test our engine's ability to predict future sensor values, retrodict earlier sensor values, and impute missing sensory data. The engine performs well in all these domains, significantly out-performing neural net baselines. We note in particular that in the sequence induction intelligence tests, our system achieved human-level performance. This is notable because our system is not a bespoke system designed specifically to solve intelligence tests, but a general-purpose system that was designed to make sense of any sensory sequence.
△ Less
Submitted 13 July, 2020; v1 submitted 5 October, 2019;
originally announced October 2019.
-
Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation
Authors:
Po-Sen Huang,
Robert Stanforth,
Johannes Welbl,
Chris Dyer,
Dani Yogatama,
Sven Gowal,
Krishnamurthy Dvijotham,
Pushmeet Kohli
Abstract:
Neural networks are part of many contemporary NLP systems, yet their empirical successes come at the price of vulnerability to adversarial attacks. Previous work has used adversarial training and data augmentation to partially mitigate such brittleness, but these are unlikely to find worst-case adversaries due to the complexity of the search space arising from discrete text perturbations. In this…
▽ More
Neural networks are part of many contemporary NLP systems, yet their empirical successes come at the price of vulnerability to adversarial attacks. Previous work has used adversarial training and data augmentation to partially mitigate such brittleness, but these are unlikely to find worst-case adversaries due to the complexity of the search space arising from discrete text perturbations. In this work, we approach the problem from the opposite direction: to formally verify a system's robustness against a predefined class of adversarial attacks. We study text classification under synonym replacements or character flip perturbations. We propose modeling these input perturbations as a simplex and then using Interval Bound Propagation -- a formal model verification method. We modify the conventional log-likelihood training objective to train models that can be efficiently verified, which would otherwise come with exponential search complexity. The resulting models show only little difference in terms of nominal accuracy, but have much improved verified accuracy under perturbations and come with an efficiently computable formal guarantee on worst case adversaries.
△ Less
Submitted 20 December, 2019; v1 submitted 3 September, 2019;
originally announced September 2019.
-
Jack the Reader - A Machine Reading Framework
Authors:
Dirk Weissenborn,
Pasquale Minervini,
Tim Dettmers,
Isabelle Augenstein,
Johannes Welbl,
Tim Rocktäschel,
Matko Bošnjak,
Jeff Mitchell,
Thomas Demeester,
Pontus Stenetorp,
Sebastian Riedel
Abstract:
Many Machine Reading and Natural Language Understanding tasks require reading supporting text in order to answer questions. For example, in Question Answering, the supporting text can be newswire or Wikipedia articles; in Natural Language Inference, premises can be seen as the supporting text and hypotheses as questions. Providing a set of useful primitives operating in a single framework of relat…
▽ More
Many Machine Reading and Natural Language Understanding tasks require reading supporting text in order to answer questions. For example, in Question Answering, the supporting text can be newswire or Wikipedia articles; in Natural Language Inference, premises can be seen as the supporting text and hypotheses as questions. Providing a set of useful primitives operating in a single framework of related tasks would allow for expressive modelling, and easier model comparison and replication. To that end, we present Jack the Reader (Jack), a framework for Machine Reading that allows for quick model prototyping by component reuse, evaluation of new models on existing datasets as well as integrating new datasets and applying them on a growing set of implemented baseline models. Jack is currently supporting (but not limited to) three tasks: Question Answering, Natural Language Inference, and Link Prediction. It is developed with the aim of increasing research efficiency and code reuse.
△ Less
Submitted 19 June, 2018;
originally announced June 2018.
-
Constructing Datasets for Multi-hop Reading Comprehension Across Documents
Authors:
Johannes Welbl,
Pontus Stenetorp,
Sebastian Riedel
Abstract:
Most Reading Comprehension methods limit themselves to queries which can be answered using a single sentence, paragraph, or document. Enabling models to combine disjoint pieces of textual evidence would extend the scope of machine comprehension methods, but currently there exist no resources to train and test this capability. We propose a novel task to encourage the development of models for text…
▽ More
Most Reading Comprehension methods limit themselves to queries which can be answered using a single sentence, paragraph, or document. Enabling models to combine disjoint pieces of textual evidence would extend the scope of machine comprehension methods, but currently there exist no resources to train and test this capability. We propose a novel task to encourage the development of models for text understanding across multiple documents and to investigate the limits of existing methods. In our task, a model learns to seek and combine evidence - effectively performing multi-hop (alias multi-step) inference. We devise a methodology to produce datasets for this task, given a collection of query-answer pairs and thematically linked documents. Two datasets from different domains are induced, and we identify potential pitfalls and devise circumvention strategies. We evaluate two previously proposed competitive models and find that one can integrate information across documents. However, both models struggle to select relevant information, as providing documents guaranteed to be relevant greatly improves their performance. While the models outperform several strong baselines, their best accuracy reaches 42.9% compared to human performance at 74.0% - leaving ample room for improvement.
△ Less
Submitted 11 June, 2018; v1 submitted 17 October, 2017;
originally announced October 2017.
-
Crowdsourcing Multiple Choice Science Questions
Authors:
Johannes Welbl,
Nelson F. Liu,
Matt Gardner
Abstract:
We present a novel method for obtaining high-quality, domain-targeted multiple choice questions from crowd workers. Generating these questions can be difficult without trading away originality, relevance or diversity in the answer options. Our method addresses these problems by leveraging a large corpus of domain-specific text and a small set of existing questions. It produces model suggestions fo…
▽ More
We present a novel method for obtaining high-quality, domain-targeted multiple choice questions from crowd workers. Generating these questions can be difficult without trading away originality, relevance or diversity in the answer options. Our method addresses these problems by leveraging a large corpus of domain-specific text and a small set of existing questions. It produces model suggestions for document selection and answer distractor choice which aid the human question generation process. With this method we have assembled SciQ, a dataset of 13.7K multiple choice science exam questions (Dataset available at http://allenai.org/data.html). We demonstrate that the method produces in-domain questions by providing an analysis of this new dataset and by showing that humans cannot distinguish the crowdsourced questions from original questions. When using SciQ as additional training data to existing questions, we observe accuracy improvements on real science exams.
△ Less
Submitted 19 July, 2017;
originally announced July 2017.
-
Knowledge Graph Completion via Complex Tensor Factorization
Authors:
Théo Trouillon,
Christopher R. Dance,
Johannes Welbl,
Sebastian Riedel,
Éric Gaussier,
Guillaume Bouchard
Abstract:
In statistical relational learning, knowledge graph completion deals with automatically understanding the structure of large knowledge graphs---labeled directed graphs---and predicting missing relationships---labeled edges. State-of-the-art embedding models propose different trade-offs between modeling expressiveness, and time and space complexity. We reconcile both expressiveness and complexity t…
▽ More
In statistical relational learning, knowledge graph completion deals with automatically understanding the structure of large knowledge graphs---labeled directed graphs---and predicting missing relationships---labeled edges. State-of-the-art embedding models propose different trade-offs between modeling expressiveness, and time and space complexity. We reconcile both expressiveness and complexity through the use of complex-valued embeddings and explore the link between such complex-valued embeddings and unitary diagonalization. We corroborate our approach theoretically and show that all real square matrices---thus all possible relation/adjacency matrices---are the real part of some unitarily diagonalizable matrix. This results opens the door to a lot of other applications of square matrices factorization. Our approach based on complex embeddings is arguably simple, as it only involves a Hermitian dot product, the complex counterpart of the standard dot product between real vectors, whereas other methods resort to more and more complicated composition functions to increase their expressiveness. The proposed complex embeddings are scalable to large data sets as it remains linear in both space and time, while consistently outperforming alternative approaches on standard link prediction benchmarks.
△ Less
Submitted 26 November, 2017; v1 submitted 22 February, 2017;
originally announced February 2017.
-
Frustratingly Short Attention Spans in Neural Language Modeling
Authors:
Michał Daniluk,
Tim Rocktäschel,
Johannes Welbl,
Sebastian Riedel
Abstract:
Neural language models predict the next token using a latent representation of the immediate token history. Recently, various methods for augmenting neural language models with an attention mechanism over a differentiable memory have been proposed. For predicting the next token, these models query information from a memory of the recent history which can facilitate learning mid- and long-range dep…
▽ More
Neural language models predict the next token using a latent representation of the immediate token history. Recently, various methods for augmenting neural language models with an attention mechanism over a differentiable memory have been proposed. For predicting the next token, these models query information from a memory of the recent history which can facilitate learning mid- and long-range dependencies. However, conventional attention mechanisms used in memory-augmented neural language models produce a single output vector per time step. This vector is used both for predicting the next token as well as for the key and value of a differentiable memory of a token history. In this paper, we propose a neural language model with a key-value attention mechanism that outputs separate representations for the key and value of a differentiable memory, as well as for encoding the next-word distribution. This model outperforms existing memory-augmented neural language models on two corpora. Yet, we found that our method mainly utilizes a memory of the five most recent output representations. This led to the unexpected main finding that a much simpler model based only on the concatenation of recent output representations from previous time steps is on par with more sophisticated memory-augmented neural language models.
△ Less
Submitted 15 February, 2017;
originally announced February 2017.
-
Complex Embeddings for Simple Link Prediction
Authors:
Théo Trouillon,
Johannes Welbl,
Sebastian Riedel,
Éric Gaussier,
Guillaume Bouchard
Abstract:
In statistical relational learning, the link prediction problem is key to automatically understand the structure of large knowledge bases. As in previous studies, we propose to solve this problem through latent factorization. However, here we make use of complex valued embeddings. The composition of complex embeddings can handle a large variety of binary relations, among them symmetric and antisym…
▽ More
In statistical relational learning, the link prediction problem is key to automatically understand the structure of large knowledge bases. As in previous studies, we propose to solve this problem through latent factorization. However, here we make use of complex valued embeddings. The composition of complex embeddings can handle a large variety of binary relations, among them symmetric and antisymmetric relations. Compared to state-of-the-art models such as Neural Tensor Network and Holographic Embeddings, our approach based on complex embeddings is arguably simpler, as it only uses the Hermitian dot product, the complex counterpart of the standard dot product between real vectors. Our approach is scalable to large datasets as it remains linear in both space and time, while consistently outperforming alternative approaches on standard link prediction benchmarks.
△ Less
Submitted 20 June, 2016;
originally announced June 2016.
-
Neural Random Forests
Authors:
Gérard Biau,
Erwan Scornet,
Johannes Welbl
Abstract:
Given an ensemble of randomized regression trees, it is possible to restructure them as a collection of multilayered neural networks with particular connection weights. Following this principle, we reformulate the random forest method of Breiman (2001) into a neural network setting, and in turn propose two new hybrid procedures that we call neural random forests. Both predictors exploit prior know…
▽ More
Given an ensemble of randomized regression trees, it is possible to restructure them as a collection of multilayered neural networks with particular connection weights. Following this principle, we reformulate the random forest method of Breiman (2001) into a neural network setting, and in turn propose two new hybrid procedures that we call neural random forests. Both predictors exploit prior knowledge of regression trees for their architecture, have less parameters to tune than standard networks, and less restrictions on the geometry of the decision boundaries than trees. Consistency results are proved, and substantial numerical evidence is provided on both synthetic and real data sets to assess the excellent performance of our methods in a large variety of prediction problems.
△ Less
Submitted 3 April, 2018; v1 submitted 25 April, 2016;
originally announced April 2016.
-
A Factorization Machine Framework for Testing Bigram Embeddings in Knowledgebase Completion
Authors:
Johannes Welbl,
Guillaume Bouchard,
Sebastian Riedel
Abstract:
Embedding-based Knowledge Base Completion models have so far mostly combined distributed representations of individual entities or relations to compute truth scores of missing links. Facts can however also be represented using pairwise embeddings, i.e. embeddings for pairs of entities and relations. In this paper we explore such bigram embeddings with a flexible Factorization Machine model and sev…
▽ More
Embedding-based Knowledge Base Completion models have so far mostly combined distributed representations of individual entities or relations to compute truth scores of missing links. Facts can however also be represented using pairwise embeddings, i.e. embeddings for pairs of entities and relations. In this paper we explore such bigram embeddings with a flexible Factorization Machine model and several ablations from it. We investigate the relevance of various bigram types on the fb15k237 dataset and find relative improvements compared to a compositional model.
△ Less
Submitted 20 April, 2016;
originally announced April 2016.