Search | arXiv e-print repository

Construct accurate multi-continuum micromorphic homogenisations in multi-D space-time with computer algebra

Abstract: Homogenisation empowers the efficient macroscale system level prediction of physical problems with intricate microscale structures. Here we develop an innovative powerful, rigorous and flexible framework for asymptotic homogenisation of dynamics at the finite scale separation of real physics, with proven results underpinned by modern dynamical systems theory. The novel systematic approach removes… ▽ More Homogenisation empowers the efficient macroscale system level prediction of physical problems with intricate microscale structures. Here we develop an innovative powerful, rigorous and flexible framework for asymptotic homogenisation of dynamics at the finite scale separation of real physics, with proven results underpinned by modern dynamical systems theory. The novel systematic approach removes most of the usual assumptions, whether implicit or explicit, of other methodologies. By no longer assuming averages the methodology constructs so-called multi-continuum or micromorphic homogenisations systematically based upon the microscale physics. The developed framework and approach enables a user to straightforwardly choose and create such homogenisations with clear physical and theoretical support, and of highly controllable accuracy and fidelity. △ Less

Submitted 3 July, 2024; originally announced July 2024.

MSC Class: 35B27; 74H10; 37L10; 35B40

arXiv:2404.03626 [pdf, other]

Training LLMs over Neurally Compressed Text

Authors: Brian Lester, Jaehoon Lee, Alex Alemi, Jeffrey Pennington, Adam Roberts, Jascha Sohl-Dickstein, Noah Constant

Abstract: In this paper, we explore the idea of training large language models (LLMs) over highly compressed text. While standard subword tokenizers compress text by a small factor, neural text compressors can achieve much higher rates of compression. If it were possible to train LLMs directly over neurally compressed text, this would confer advantages in training and serving efficiency, as well as easier h… ▽ More In this paper, we explore the idea of training large language models (LLMs) over highly compressed text. While standard subword tokenizers compress text by a small factor, neural text compressors can achieve much higher rates of compression. If it were possible to train LLMs directly over neurally compressed text, this would confer advantages in training and serving efficiency, as well as easier handling of long text spans. The main obstacle to this goal is that strong compression tends to produce opaque outputs that are not well-suited for learning. In particular, we find that text naïvely compressed via Arithmetic Coding is not readily learnable by LLMs. To overcome this, we propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length. Using this method, we demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks. While our method delivers worse perplexity than subword tokenizers for models trained with the same parameter count, it has the benefit of shorter sequence lengths. Shorter sequence lengths require fewer autoregressive generation steps, and reduce latency. Finally, we provide extensive analysis of the properties that contribute to learnability, and offer concrete suggestions for how to further improve the performance of high-compression tokenizers. △ Less

Submitted 4 April, 2024; originally announced April 2024.

arXiv:2404.01413 [pdf, other]

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

Authors: Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo

Abstract: The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops proposed that such loops would lead to a phenomenon termed model collapse, under which performance progressively degrades with each model-data feedback iteration… ▽ More The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops proposed that such loops would lead to a phenomenon termed model collapse, under which performance progressively degrades with each model-data feedback iteration until fitted models become useless. However, those studies largely assumed that new data replace old data over time, where an arguably more realistic assumption is that data accumulate over time. In this paper, we ask: what effect does accumulating data have on model collapse? We empirically study this question by pretraining sequences of language models on text corpora. We confirm that replacing the original real data by each generation's synthetic data does indeed tend towards model collapse, then demonstrate that accumulating the successive generations of synthetic data alongside the original real data avoids model collapse; these results hold across a range of model sizes, architectures, and hyperparameters. We obtain similar results for deep generative models on other types of real data: diffusion models for molecule conformation generation and variational autoencoders for image generation. To understand why accumulating data can avoid model collapse, we use an analytically tractable framework introduced by prior work in which a sequence of linear models are fit to the previous models' outputs. Previous work used this framework to show that if data are replaced, the test error increases with the number of model-fitting iterations; we extend this argument to prove that if data instead accumulate, the test error has a finite upper bound independent of the number of iterations, meaning model collapse no longer occurs. △ Less

Submitted 29 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

arXiv:2403.17887 [pdf, other]

The Unreasonable Ineffectiveness of the Deeper Layers

Authors: Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts

Abstract: We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers; then, to "heal" the damage… ▽ More We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers; then, to "heal" the damage, we perform a small amount of finetuning. In particular, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single A100 GPU. From a practical perspective, these results suggest that layer pruning methods can complement other PEFT strategies to further reduce computational resources of finetuning on the one hand, and can improve the memory and latency of inference on the other hand. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: 12 + 10 pages, 5 + 4 figures

Report number: MIT-CTP/5694

arXiv:2403.08295 [pdf, other]

Gemma: Open Models Based on Gemini Research and Technology

Authors: Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari , et al. (83 additional authors not shown)

Abstract: This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Ge… ▽ More This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations. △ Less

Submitted 16 April, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

arXiv:2402.01700 [pdf]

Question answering systems for health professionals at the point of care -- a systematic review

Authors: Gregory Kell, Angus Roberts, Serge Umansky, Linglong Qian, Davide Ferrari, Frank Soboczenski, Byron Wallace, Nikhil Patel, Iain J Marshall

Abstract: Objective: Question answering (QA) systems have the potential to improve the quality of clinical care by providing health professionals with the latest and most relevant evidence. However, QA systems have not been widely adopted. This systematic review aims to characterize current medical QA systems, assess their suitability for healthcare, and identify areas of improvement. Materials and method… ▽ More Objective: Question answering (QA) systems have the potential to improve the quality of clinical care by providing health professionals with the latest and most relevant evidence. However, QA systems have not been widely adopted. This systematic review aims to characterize current medical QA systems, assess their suitability for healthcare, and identify areas of improvement. Materials and methods: We searched PubMed, IEEE Xplore, ACM Digital Library, ACL Anthology and forward and backward citations on 7th February 2023. We included peer-reviewed journal and conference papers describing the design and evaluation of biomedical QA systems. Two reviewers screened titles, abstracts, and full-text articles. We conducted a narrative synthesis and risk of bias assessment for each study. We assessed the utility of biomedical QA systems. Results: We included 79 studies and identified themes, including question realism, answer reliability, answer utility, clinical specialism, systems, usability, and evaluation methods. Clinicians' questions used to train and evaluate QA systems were restricted to certain sources, types and complexity levels. No system communicated confidence levels in the answers or sources. Many studies suffered from high risks of bias and applicability concerns. Only 8 studies completely satisfied any criterion for clinical utility, and only 7 reported user evaluations. Most systems were built with limited input from clinicians. Discussion: While machine learning methods have led to increased accuracy, most studies imperfectly reflected real-world healthcare information needs. Key research priorities include developing more realistic healthcare QA datasets and considering the reliability of answer sources, rather than merely focusing on accuracy. △ Less

Submitted 24 January, 2024; originally announced February 2024.

Comments: Accepted to the Journal of the American Medical Informatics Association (JAMIA)

arXiv:2401.04792 [pdf, other]

REACT: Autonomous Intrusion Response System for Intelligent Vehicles

Authors: Mohammad Hamad, Andreas Finkenzeller, Michael Kühr, Andrew Roberts, Olaf Maennel, Vassilis Prevelakis, Sebastian Steinhorst

Abstract: Autonomous and connected vehicles are rapidly evolving, integrating numerous technologies and software. This progress, however, has made them appealing targets for cybersecurity attacks. As the risk of cyber threats escalates with this advancement, the focus is shifting from solely preventing these attacks to also mitigating their impact. Current solutions rely on vehicle security operation center… ▽ More Autonomous and connected vehicles are rapidly evolving, integrating numerous technologies and software. This progress, however, has made them appealing targets for cybersecurity attacks. As the risk of cyber threats escalates with this advancement, the focus is shifting from solely preventing these attacks to also mitigating their impact. Current solutions rely on vehicle security operation centers, where attack information is analyzed before deciding on a response strategy. However, this process can be time-consuming and faces scalability challenges, along with other issues stemming from vehicle connectivity. This paper proposes a dynamic intrusion response system integrated within the vehicle. This system enables the vehicle to respond to a variety of incidents almost instantly, thereby reducing the need for interaction with the vehicle security operation center. The system offers a comprehensive list of potential responses, a methodology for response evaluation, and various response selection methods. The proposed solution was implemented on an embedded platform. Two distinct cyberattack use cases served as the basis for evaluating the system. The evaluation highlights the system's adaptability, its ability to respond swiftly, its minimal memory footprint, and its capacity for dynamic system parameter adjustments. The proposed solution underscores the necessity and feasibility of incorporating dynamic response mechanisms in smart vehicles. This is a crucial factor in ensuring the safety and resilience of future smart mobility. △ Less

Submitted 16 January, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

Comments: 20 pages

arXiv:2311.10912 [pdf, ps, other]

Opportunities in Mental Health Support for Informal Dementia Caregivers Suffering from Verbal Agitation

Authors: Taewook Kim, Hyeok Kim, Angela Roberts, Maia Jacobs, Matthew Kay

Abstract: People with dementia (PwD) often present verbal agitation such as cursing, screaming, and persistently complaining. Verbal agitation can impose mental distress on informal caregivers (e.g., family, friends), which may cause severe mental illnesses, such as depression and anxiety disorders. To improve informal caregivers' mental health, we explore design opportunities by interviewing 11 informal ca… ▽ More People with dementia (PwD) often present verbal agitation such as cursing, screaming, and persistently complaining. Verbal agitation can impose mental distress on informal caregivers (e.g., family, friends), which may cause severe mental illnesses, such as depression and anxiety disorders. To improve informal caregivers' mental health, we explore design opportunities by interviewing 11 informal caregivers suffering from verbal agitation of PwD. In particular, we first characterize how the predictability of verbal agitation impacts informal caregivers' mental health and how caregivers' coping strategies vary before, during, and after verbal agitation. Based on our findings, we propose design opportunities to improve the mental health of informal caregivers suffering from verbal agitation: distracting PwD (in-situ support; before), prompting just-in-time maneuvers (information support; during), and comfort and education (social & information support; after). We discuss our reflections on cultural disparities between participants. Our work envisions a broader design space for supporting informal caregivers' well-being and describes when and how that support could be provided. △ Less

Submitted 17 November, 2023; originally announced November 2023.

Comments: 26 pages, 1 figure, 2 tables. Accepted to PACM HCI (CSCW 2024)

arXiv:2311.06477 [pdf, other]

Report of the 1st Workshop on Generative AI and Law

Authors: A. Feder Cooper, Katherine Lee, James Grimmelmann, Daphne Ippolito, Christopher Callison-Burch, Christopher A. Choquette-Choo, Niloofar Mireshghallah, Miles Brundage, David Mimno, Madiha Zahrah Choksi, Jack M. Balkin, Nicholas Carlini, Christopher De Sa, Jonathan Frankle, Deep Ganguli, Bryant Gipson, Andres Guadamuz, Swee Leng Harris, Abigail Z. Jacobs, Elizabeth Joh, Gautam Kamath, Mark Lemley, Cass Matthews, Christine McLeavey, Corynne McSherry , et al. (10 additional authors not shown)

Abstract: This report presents the takeaways of the inaugural Workshop on Generative AI and Law (GenLaw), held in July 2023. A cross-disciplinary group of practitioners and scholars from computer science and law convened to discuss the technical, doctrinal, and policy challenges presented by law for Generative AI, and by Generative AI for law, with an emphasis on U.S. law in particular. We begin the report… ▽ More This report presents the takeaways of the inaugural Workshop on Generative AI and Law (GenLaw), held in July 2023. A cross-disciplinary group of practitioners and scholars from computer science and law convened to discuss the technical, doctrinal, and policy challenges presented by law for Generative AI, and by Generative AI for law, with an emphasis on U.S. law in particular. We begin the report with a high-level statement about why Generative AI is both immensely significant and immensely challenging for law. To meet these challenges, we conclude that there is an essential need for 1) a shared knowledge base that provides a common conceptual language for experts across disciplines; 2) clarification of the distinctive technical capabilities of generative-AI systems, as compared and contrasted to other computer and AI systems; 3) a logical taxonomy of the legal issues these systems raise; and, 4) a concrete research agenda to promote collaboration and knowledge-sharing on emerging issues at the intersection of Generative AI and law. In this report, we synthesize the key takeaways from the GenLaw workshop that begin to address these needs. All of the listed authors contributed to the workshop upon which this report is based, but they and their organizations do not necessarily endorse all of the specific claims in this report. △ Less

Submitted 2 December, 2023; v1 submitted 10 November, 2023; originally announced November 2023.

arXiv:2310.07765 [pdf, other]

Feature Learning and Generalization in Deep Networks with Orthogonal Weights

Authors: Hannah Day, Yonatan Kahn, Daniel A. Roberts

Abstract: Fully-connected deep neural networks with weights initialized from independent Gaussian distributions can be tuned to criticality, which prevents the exponential growth or decay of signals propagating through the network. However, such networks still exhibit fluctuations that grow linearly with the depth of the network, which may impair the training of networks with width comparable to depth. We s… ▽ More Fully-connected deep neural networks with weights initialized from independent Gaussian distributions can be tuned to criticality, which prevents the exponential growth or decay of signals propagating through the network. However, such networks still exhibit fluctuations that grow linearly with the depth of the network, which may impair the training of networks with width comparable to depth. We show analytically that rectangular networks with tanh activations and weights initialized from the ensemble of orthogonal matrices have corresponding preactivation fluctuations which are independent of depth, to leading order in inverse width. Moreover, we demonstrate numerically that, at initialization, all correlators involving the neural tangent kernel (NTK) and its descendants at leading order in inverse width -- which govern the evolution of observables during training -- saturate at a depth of $\sim 20$, rather than growing without bound as in the case of Gaussian initializations. We speculate that this structure preserves finite-width feature learning while reducing overall noise, thus improving both generalization and training speed in deep networks with depth comparable to width. We provide some experimental justification by relating empirical measurements of the NTK to the superior performance of deep nonlinear orthogonal networks trained under full-batch gradient descent on the MNIST and CIFAR-10 classification tasks. △ Less

Submitted 12 June, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

Comments: v2: numerical experiments updated with more data, plots updated to match, conclusions unchanged. 30+12 pages, 20 figures

Report number: MIT-CTP/5625

arXiv:2309.02237 [pdf]

Sample Size in Natural Language Processing within Healthcare Research

Authors: Jaya Chaturvedi, Diana Shamsutdinova, Felix Zimmer, Sumithra Velupillai, Daniel Stahl, Robert Stewart, Angus Roberts

Abstract: Sample size calculation is an essential step in most data-based disciplines. Large enough samples ensure representativeness of the population and determine the precision of estimates. This is true for most quantitative studies, including those that employ machine learning methods, such as natural language processing, where free-text is used to generate predictions and classify instances of text. W… ▽ More Sample size calculation is an essential step in most data-based disciplines. Large enough samples ensure representativeness of the population and determine the precision of estimates. This is true for most quantitative studies, including those that employ machine learning methods, such as natural language processing, where free-text is used to generate predictions and classify instances of text. Within the healthcare domain, the lack of sufficient corpora of previously collected data can be a limiting factor when determining sample sizes for new studies. This paper tries to address the issue by making recommendations on sample sizes for text classification tasks in the healthcare domain. Models trained on the MIMIC-III database of critical care records from Beth Israel Deaconess Medical Center were used to classify documents as having or not having Unspecified Essential Hypertension, the most common diagnosis code in the database. Simulations were performed using various classifiers on different sample sizes and class proportions. This was repeated for a comparatively less common diagnosis code within the database of diabetes mellitus without mention of complication. Smaller sample sizes resulted in better results when using a K-nearest neighbours classifier, whereas larger sample sizes provided better results with support vector machines and BERT models. Overall, a sample size larger than 1000 was sufficient to provide decent performance metrics. The simulations conducted within this study provide guidelines that can be used as recommendations for selecting appropriate sample sizes and class proportions, and for predicting expected performance, when building classifiers for textual healthcare data. The methodology used here can be modified for sample size estimates calculations with other datasets. △ Less

Submitted 5 September, 2023; originally announced September 2023.

Comments: Submitted to Journal of Biomedical Informatics

arXiv:2308.09226 [pdf, other]

Efficient computational homogenisation of 2D beams of heterogeneous elasticity using the patch scheme

Authors: Thien Tran-Duc, J. E. Bunder, A. J. Roberts

Abstract: Modern 'smart' materials have complex heterogeneous microscale structure, often with unknown macroscale closure but one we need to realise for large scale engineering and science. The multiscale Equation-Free Patch Scheme empowers us to non-intrusively, efficiently, and accurately predict the large scale, system level, solutions through computations on only small sparse patches of the given detail… ▽ More Modern 'smart' materials have complex heterogeneous microscale structure, often with unknown macroscale closure but one we need to realise for large scale engineering and science. The multiscale Equation-Free Patch Scheme empowers us to non-intrusively, efficiently, and accurately predict the large scale, system level, solutions through computations on only small sparse patches of the given detailed microscale system. Here the microscale system is that of a 2D beam of heterogeneous elasticity, with either fixed fixed, fixed-free, or periodic boundary conditions. We demonstrate that the described multiscale Patch Scheme simply, efficiently, and stably predicts the beam's macroscale, with a controllable accuracy, at finite scale separation. Dynamical systems theory supports the scheme. This article points the way for others to use this systematic non-intrusive approach, via a developing toolbox of functions, to model and compute accurately macroscale system-levels of general complex physical and engineering systems. △ Less

Submitted 17 August, 2023; originally announced August 2023.

arXiv:2308.08904 [pdf]

Development of a Knowledge Graph Embeddings Model for Pain

Authors: Jaya Chaturvedi, Tao Wang, Sumithra Velupillai, Robert Stewart, Angus Roberts

Abstract: Pain is a complex concept that can interconnect with other concepts such as a disorder that might cause pain, a medication that might relieve pain, and so on. To fully understand the context of pain experienced by either an individual or across a population, we may need to examine all concepts related to pain and the relationships between them. This is especially useful when modeling pain that has… ▽ More Pain is a complex concept that can interconnect with other concepts such as a disorder that might cause pain, a medication that might relieve pain, and so on. To fully understand the context of pain experienced by either an individual or across a population, we may need to examine all concepts related to pain and the relationships between them. This is especially useful when modeling pain that has been recorded in electronic health records. Knowledge graphs represent concepts and their relations by an interlinked network, enabling semantic and context-based reasoning in a computationally tractable form. These graphs can, however, be too large for efficient computation. Knowledge graph embeddings help to resolve this by representing the graphs in a low-dimensional vector space. These embeddings can then be used in various downstream tasks such as classification and link prediction. The various relations associated with pain which are required to construct such a knowledge graph can be obtained from external medical knowledge bases such as SNOMED CT, a hierarchical systematic nomenclature of medical terms. A knowledge graph built in this way could be further enriched with real-world examples of pain and its relations extracted from electronic health records. This paper describes the construction of such knowledge graph embedding models of pain concepts, extracted from the unstructured text of mental health electronic health records, combined with external knowledge created from relations described in SNOMED CT, and their evaluation on a subject-object link prediction task. The performance of the models was compared with other baseline models. △ Less

Submitted 17 August, 2023; originally announced August 2023.

Comments: Accepted at AMIA 2023, New Orleans

arXiv:2305.13169 [pdf, other]

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

Authors: Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, Daphne Ippolito

Abstract: Pretraining is the preliminary and fundamental step in developing capable language models (LM). Despite this, pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. To address this, we pretrain 28 1.5B parameter decoder-only models, training on data curated (1) at different times, (2) with varying toxicity and quality filters, and (3) with di… ▽ More Pretraining is the preliminary and fundamental step in developing capable language models (LM). Despite this, pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. To address this, we pretrain 28 1.5B parameter decoder-only models, training on data curated (1) at different times, (2) with varying toxicity and quality filters, and (3) with different domain compositions. First, we quantify the effect of pretraining data age. A temporal shift between evaluation data and pretraining data leads to performance degradation, which is not overcome by finetuning. Second, we explore the effect of quality and toxicity filters, showing a trade-off between performance on standard benchmarks and risk of toxic generations. Our findings indicate there does not exist a one-size-fits-all solution to filtering training data. We also find that the effects of different types of filtering are not predictable from text domain characteristics. Lastly, we empirically validate that the inclusion of heterogeneous data sources, like books and web, is broadly beneficial and warrants greater prioritization. These findings constitute the largest set of experiments to validate, quantify, and expose many undocumented intuitions about text pretraining, which we hope will help support more informed data-centric decisions in LM development. △ Less

Submitted 13 November, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

arXiv:2304.09151 [pdf, other]

UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining

Authors: Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, Orhan Firat

Abstract: Pretrained multilingual large language models have typically used heuristic temperature-based sampling to balance between different languages. However previous work has not systematically evaluated the efficacy of different pretraining language distributions across model scales. In this paper, we propose a new sampling method, UniMax, that delivers more uniform coverage of head languages while mit… ▽ More Pretrained multilingual large language models have typically used heuristic temperature-based sampling to balance between different languages. However previous work has not systematically evaluated the efficacy of different pretraining language distributions across model scales. In this paper, we propose a new sampling method, UniMax, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages by explicitly capping the number of repeats over each language's corpus. We perform an extensive series of ablations testing a range of sampling strategies on a suite of multilingual benchmarks, while varying model scale. We find that UniMax outperforms standard temperature-based sampling, and the benefits persist as scale increases. As part of our contribution, we release: (i) an improved and refreshed mC4 multilingual corpus consisting of 29 trillion characters across 107 languages, and (ii) a suite of pretrained umT5 model checkpoints trained with UniMax sampling. △ Less

Submitted 18 April, 2023; originally announced April 2023.

arXiv:2304.01240 [pdf]

Identifying Mentions of Pain in Mental Health Records Text: A Natural Language Processing Approach

Authors: Jaya Chaturvedi, Sumithra Velupillai, Robert Stewart, Angus Roberts

Abstract: Pain is a common reason for accessing healthcare resources and is a growing area of research, especially in its overlap with mental health. Mental health electronic health records are a good data source to study this overlap. However, much information on pain is held in the free text of these records, where mentions of pain present a unique natural language processing problem due to its ambiguous… ▽ More Pain is a common reason for accessing healthcare resources and is a growing area of research, especially in its overlap with mental health. Mental health electronic health records are a good data source to study this overlap. However, much information on pain is held in the free text of these records, where mentions of pain present a unique natural language processing problem due to its ambiguous nature. This project uses data from an anonymised mental health electronic health records database. The data are used to train a machine learning based classification algorithm to classify sentences as discussing patient pain or not. This will facilitate the extraction of relevant pain information from large databases, and the use of such outputs for further studies on pain and mental health. 1,985 documents were manually triple-annotated for creation of gold standard training data, which was used to train three commonly used classification algorithms. The best performing model achieved an F1-score of 0.98 (95% CI 0.98-0.99). △ Less

Submitted 5 April, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

Comments: 5 pages, 2 tables, submitted to MEDINFO 2023 conference

arXiv:2301.13688 [pdf, other]

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

Authors: Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, Barret Zoph, Jason Wei, Adam Roberts

Abstract: We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 (Chung et al., 2022). Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniqu… ▽ More We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 (Chung et al., 2022). Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+) performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks, motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available at https://github.com/google-research/FLAN/tree/main/flan/v2. △ Less

Submitted 14 February, 2023; v1 submitted 31 January, 2023; originally announced January 2023.

arXiv:2301.12662 [pdf, other]

SingSong: Generating musical accompaniments from singing

Authors: Chris Donahue, Antoine Caillon, Adam Roberts, Ethan Manilow, Philippe Esling, Andrea Agostinelli, Mauro Verzetti, Ian Simon, Olivier Pietquin, Neil Zeghidour, Jesse Engel

Abstract: We present SingSong, a system that generates instrumental music to accompany input vocals, potentially offering musicians and non-musicians alike an intuitive new way to create music featuring their own voice. To accomplish this, we build on recent developments in musical source separation and audio generation. Specifically, we apply a state-of-the-art source separation algorithm to a large corpus… ▽ More We present SingSong, a system that generates instrumental music to accompany input vocals, potentially offering musicians and non-musicians alike an intuitive new way to create music featuring their own voice. To accomplish this, we build on recent developments in musical source separation and audio generation. Specifically, we apply a state-of-the-art source separation algorithm to a large corpus of music audio to produce aligned pairs of vocals and instrumental sources. Then, we adapt AudioLM (Borsos et al., 2022) -- a state-of-the-art approach for unconditional audio generation -- to be suitable for conditional "audio-to-audio" generation tasks, and train it on the source-separated (vocal, instrumental) pairs. In a pairwise comparison with the same vocal inputs, listeners expressed a significant preference for instrumentals generated by SingSong compared to those from a strong retrieval baseline. Sound examples at https://g.co/magenta/singsong △ Less

Submitted 29 January, 2023; originally announced January 2023.

arXiv:2301.11325 [pdf, other]

MusicLM: Generating Music From Text

Authors: Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank

Abstract: We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous s… ▽ More We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts. △ Less

Submitted 26 January, 2023; originally announced January 2023.

Comments: Supplementary material at https://google-research.github.io/seanet/musiclm/examples and https://kaggle.com/datasets/googleai/musiccaps

arXiv:2212.10562 [pdf, other]

Character-Aware Models Improve Visual Text Rendering

Authors: Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, RJ Mical, Mohammad Norouzi, Noah Constant

Abstract: Current image generation models struggle to reliably produce well-formed visual text. In this paper, we investigate a key contributing factor: popular text-to-image models lack character-level input features, making it much harder to predict a word's visual makeup as a series of glyphs. To quantify this effect, we conduct a series of experiments comparing character-aware vs. character-blind text e… ▽ More Current image generation models struggle to reliably produce well-formed visual text. In this paper, we investigate a key contributing factor: popular text-to-image models lack character-level input features, making it much harder to predict a word's visual makeup as a series of glyphs. To quantify this effect, we conduct a series of experiments comparing character-aware vs. character-blind text encoders. In the text-only domain, we find that character-aware models provide large gains on a novel spelling task (WikiSpell). Applying our learnings to the visual domain, we train a suite of image generation models, and show that character-aware variants outperform their character-blind counterparts across a range of novel text rendering tasks (our DrawText benchmark). Our models set a much higher state-of-the-art on visual spelling, with 30+ point accuracy gains over competitors on rare words, despite training on far fewer examples. △ Less

Submitted 3 May, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

arXiv:2211.14218 [pdf, ps, other]

Shotgun assembly of random graphs

Authors: Tom Johnston, Gal Kronenberg, Alexander Roberts, Alex Scott

Abstract: In the graph shotgun assembly problem, we are given the balls of radius $r$ around each vertex of a graph and asked to reconstruct the graph. We study the shotgun assembly of the Erdős-Rényi random graph $\mathcal G(n,p)$ from a wide range of values of $r$. We determine the threshold for reconstructibility for each $r\geq 3$, extending and improving substantially on results of Mossel and Ross for… ▽ More In the graph shotgun assembly problem, we are given the balls of radius $r$ around each vertex of a graph and asked to reconstruct the graph. We study the shotgun assembly of the Erdős-Rényi random graph $\mathcal G(n,p)$ from a wide range of values of $r$. We determine the threshold for reconstructibility for each $r\geq 3$, extending and improving substantially on results of Mossel and Ross for $r=3$. For $r=2$, we give upper and lower bounds that improve on results of Gaudio and Mossel by polynomial factors. We also give a sharpening of a result of Huang and Tikhomirov for $r=1$. △ Less

Submitted 3 June, 2023; v1 submitted 25 November, 2022; originally announced November 2022.

Comments: 36 pages, 3 figures

arXiv:2211.09760 [pdf, other]

VeLO: Training Versatile Learned Optimizers by Scaling Up

Authors: Luke Metz, James Harrison, C. Daniel Freeman, Amil Merchant, Lucas Beyer, James Bradbury, Naman Agrawal, Ben Poole, Igor Mordatch, Adam Roberts, Jascha Sohl-Dickstein

Abstract: While deep learning models have replaced hand-designed features across many domains, these models are still trained with hand-designed optimizers. In this work, we leverage the same scaling approach behind the success of deep learning to learn versatile optimizers. We train an optimizer for deep learning which is itself a small neural network that ingests gradients and outputs parameter updates. M… ▽ More While deep learning models have replaced hand-designed features across many domains, these models are still trained with hand-designed optimizers. In this work, we leverage the same scaling approach behind the success of deep learning to learn versatile optimizers. We train an optimizer for deep learning which is itself a small neural network that ingests gradients and outputs parameter updates. Meta-trained with approximately four thousand TPU-months of compute on a wide variety of optimization tasks, our optimizer not only exhibits compelling performance, but optimizes in interesting and unexpected ways. It requires no hyperparameter tuning, instead automatically adapting to the specifics of the problem being optimized. We open source our learned optimizer, meta-training code, the associated train and test data, and an extensive optimizer benchmark suite with baselines at velo-code.github.io. △ Less

Submitted 17 November, 2022; originally announced November 2022.

arXiv:2211.08411 [pdf, other]

Large Language Models Struggle to Learn Long-Tail Knowledge

Authors: Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, Colin Raffel

Abstract: The Internet contains a wealth of knowledge -- from the birthdays of historical figures to tutorials on how to code -- all of which may be learned by language models. However, while certain pieces of information are ubiquitous on the web, others appear extremely rarely. In this paper, we study the relationship between the knowledge memorized by large language models and the information in pre-trai… ▽ More The Internet contains a wealth of knowledge -- from the birthdays of historical figures to tutorials on how to code -- all of which may be learned by language models. However, while certain pieces of information are ubiquitous on the web, others appear extremely rarely. In this paper, we study the relationship between the knowledge memorized by large language models and the information in pre-training datasets scraped from the web. In particular, we show that a language model's ability to answer a fact-based question relates to how many documents associated with that question were seen during pre-training. We identify these relevant documents by entity linking pre-training datasets and counting documents that contain the same entities as a given question-answer pair. Our results demonstrate strong correlational and causal relationships between accuracy and relevant document count for numerous question answering datasets (e.g., TriviaQA), pre-training corpora (e.g., ROOTS), and model sizes (e.g., 176B parameters). Moreover, while larger models are better at learning long-tail knowledge, we estimate that today's models must be scaled by many orders of magnitude to reach competitive QA performance on questions with little support in the pre-training data. Finally, we show that retrieval-augmentation can reduce the dependence on relevant pre-training information, presenting a promising approach for capturing the long-tail. △ Less

Submitted 27 July, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

Comments: ICML 2023 Camera Ready Version

arXiv:2211.05100 [pdf, other]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License. △ Less

Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

arXiv:2211.01786 [pdf, other]

Crosslingual Generalization through Multitask Finetuning

Authors: Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, Colin Raffel

Abstract: Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting, but so far explorations of MTF have focused on English data and models. We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0. We find finetuning large multilingual language models on English tasks wi… ▽ More Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting, but so far explorations of MTF have focused on English data and models. We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0. We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages that appear only in the pretraining corpus. Finetuning on multilingual tasks with English prompts further improves performance on English and non-English tasks leading to various state-of-the-art zero-shot results. We also investigate finetuning on multilingual tasks with prompts that have been machine-translated from English to match the language of each dataset. We find training on these machine-translated prompts leads to better performance on human-written prompts in the respective languages. Surprisingly, we find models are capable of zero-shot generalization to tasks in languages they have never intentionally seen. We conjecture that the models are learning higher-level capabilities that are both task- and language-agnostic. In addition, we introduce xP3, a composite of supervised datasets in 46 languages with English and machine-translated prompts. Our code, datasets and models are freely available at https://github.com/bigscience-workshop/xmtf. △ Less

Submitted 29 May, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

Comments: 9 main pages (119 with appendix), 16 figures and 11 tables

arXiv:2210.16859 [pdf, other]

A Solvable Model of Neural Scaling Laws

Authors: Alexander Maloney, Daniel A. Roberts, James Sully

Abstract: Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws: specifically, their performance behaves predictably as a power law in either parameters or dataset size until bottlenecked by the other resource. To understand this better, we first identify the necessary properties allowing such sca… ▽ More Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws: specifically, their performance behaves predictably as a power law in either parameters or dataset size until bottlenecked by the other resource. To understand this better, we first identify the necessary properties allowing such scaling laws to arise and then propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology. By solving this model in the dual limit of large training set size and large number of parameters, we gain insight into (i) the statistical structure of datasets and tasks that lead to scaling laws, (ii) the way nonlinear feature maps, such as those provided by neural networks, enable scaling laws when trained on these datasets, (iii) the optimality of the equiparameterization scaling of training sets and parameters, and (iv) whether such scaling laws can break down and how they behave when they do. Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps and then translated into power-law scalings of the test loss and how the finite extent of the data's spectral power law causes the model's performance to plateau. △ Less

Submitted 30 October, 2022; originally announced October 2022.

Comments: 73 + 23 pages, 14 + 5 figures

Report number: MIT-CTP/5463

arXiv:2210.11416 [pdf, other]

Scaling Instruction-Finetuned Language Models

Authors: Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang , et al. (10 additional authors not shown)

Abstract: Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects d… ▽ More Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models. △ Less

Submitted 6 December, 2022; v1 submitted 20 October, 2022; originally announced October 2022.

Comments: Public checkpoints: https://huggingface.co/docs/transformers/model_doc/flan-t5

arXiv:2206.14639 [pdf, other]

DDKtor: Automatic Diadochokinetic Speech Analysis

Authors: Yael Segal, Kasia Hitczenko, Matthew Goldrick, Adam Buchwald, Angela Roberts, Joseph Keshet

Abstract: Diadochokinetic speech tasks (DDK), in which participants repeatedly produce syllables, are commonly used as part of the assessment of speech motor impairments. These studies rely on manual analyses that are time-intensive, subjective, and provide only a coarse-grained picture of speech. This paper presents two deep neural network models that automatically segment consonants and vowels from unanno… ▽ More Diadochokinetic speech tasks (DDK), in which participants repeatedly produce syllables, are commonly used as part of the assessment of speech motor impairments. These studies rely on manual analyses that are time-intensive, subjective, and provide only a coarse-grained picture of speech. This paper presents two deep neural network models that automatically segment consonants and vowels from unannotated, untranscribed speech. Both models work on the raw waveform and use convolutional layers for feature extraction. The first model is based on an LSTM classifier followed by fully connected layers, while the second model adds more convolutional layers followed by fully connected layers. These segmentations predicted by the models are used to obtain measures of speech rate and sound duration. Results on a young healthy individuals dataset show that our LSTM model outperforms the current state-of-the-art systems and performs comparably to trained human annotators. Moreover, the LSTM model also presents comparable results to trained human annotators when evaluated on unseen older individuals with Parkinson's Disease dataset. △ Less

Submitted 29 June, 2022; originally announced June 2022.

Comments: Accepted to Interspeech 2022

arXiv:2206.05408 [pdf, other]

Multi-instrument Music Synthesis with Spectrogram Diffusion

Authors: Curtis Hawthorne, Ian Simon, Adam Roberts, Neil Zeghidour, Josh Gardner, Ethan Manilow, Jesse Engel

Abstract: An ideal music synthesizer should be both interactive and expressive, generating high-fidelity audio in realtime for arbitrary combinations of instruments and notes. Recent neural synthesizers have exhibited a tradeoff between domain-specific models that offer detailed control of only specific instruments, or raw waveform models that can train on any music but with minimal control and slow generat… ▽ More An ideal music synthesizer should be both interactive and expressive, generating high-fidelity audio in realtime for arbitrary combinations of instruments and notes. Recent neural synthesizers have exhibited a tradeoff between domain-specific models that offer detailed control of only specific instruments, or raw waveform models that can train on any music but with minimal control and slow generation. In this work, we focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime. This enables training on a wide range of transcription datasets with a single model, which in turn offers note-level control of composition and instrumentation across a wide range of instruments. We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter. We compare training the decoder as an autoregressive model and as a Denoising Diffusion Probabilistic Model (DDPM) and find that the DDPM approach is superior both qualitatively and as measured by audio reconstruction and Fréchet distance metrics. Given the interactivity and generality of this approach, we find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes. △ Less

Submitted 12 December, 2022; v1 submitted 10 June, 2022; originally announced June 2022.

arXiv:2204.05832 [pdf, other]

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

Authors: Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, Colin Raffel

Abstract: Large pretrained Transformer language models have been shown to exhibit zero-shot generalization, i.e. they can perform a wide variety of tasks that they were not explicitly trained on. However, the architectures and pretraining objectives used across state-of-the-art models differ significantly, and there has been limited systematic comparison of these factors. In this work, we present a large-sc… ▽ More Large pretrained Transformer language models have been shown to exhibit zero-shot generalization, i.e. they can perform a wide variety of tasks that they were not explicitly trained on. However, the architectures and pretraining objectives used across state-of-the-art models differ significantly, and there has been limited systematic comparison of these factors. In this work, we present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. In particular, we focus on text-to-text models and experiment with three model architectures (causal/non-causal decoder-only and encoder-decoder), trained with two different pretraining objectives (autoregressive and masked language modeling), and evaluated with and without multitask prompted finetuning. We train models with over 5 billion parameters for more than 170 billion tokens, thereby increasing the likelihood that our conclusions will transfer to even larger scales. Our experiments show that causal decoder-only models trained on an autoregressive language modeling objective exhibit the strongest zero-shot generalization after purely unsupervised pretraining. However, models with non-causal visibility on their input trained with a masked language modeling objective followed by multitask finetuning perform the best among our experiments. We therefore consider the adaptation of pretrained models across architectures and objectives. We find that pretrained non-causal decoder models can be adapted into performant generative causal decoder models, using autoregressive language modeling as a downstream task. Furthermore, we find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models, ultimately achieving competitive performance after multitask finetuning. Code and checkpoints are available at https://github.com/bigscience-workshop/architecture-objective. △ Less

Submitted 12 April, 2022; originally announced April 2022.

arXiv:2204.02311 [pdf, other]

PaLM: Scaling Language Modeling with Pathways

Authors: Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin , et al. (42 additional authors not shown)

Abstract: Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Tran… ▽ More Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. △ Less

Submitted 5 October, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

arXiv:2203.17189 [pdf, other]

Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$

Authors: Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz, Alex Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis Bulian, Xavier Garcia, Jianmo Ni, Andrew Chen , et al. (18 additional authors not shown)

Abstract: Recent neural network-based language models have benefited greatly from scaling up the size of training datasets and the number of parameters in the models themselves. Scaling can be complicated due to various factors including the need to distribute computation on supercomputer clusters (e.g., TPUs), prevent bottlenecks when infeeding data, and ensure reproducible results. In this work, we presen… ▽ More Recent neural network-based language models have benefited greatly from scaling up the size of training datasets and the number of parameters in the models themselves. Scaling can be complicated due to various factors including the need to distribute computation on supercomputer clusters (e.g., TPUs), prevent bottlenecks when infeeding data, and ensure reproducible results. In this work, we present two software libraries that ease these issues: $\texttt{t5x}$ simplifies the process of building and training large language models at scale while maintaining ease of use, and $\texttt{seqio}$ provides a task-based API for simple creation of fast and reproducible training data and evaluation pipelines. These open-source libraries have been used to train models with hundreds of billions of parameters on datasets with multiple terabytes of training data. Along with the libraries, we release configurations and instructions for T5-like encoder-decoder models as well as GPT-like decoder-only architectures. $\texttt{t5x}$ and $\texttt{seqio}$ are open source and available at https://github.com/google-research/t5x and https://github.com/google/seqio, respectively. △ Less

Submitted 31 March, 2022; originally announced March 2022.

arXiv:2201.08239 [pdf, other]

LaMDA: Language Models for Dialog Applications

Authors: Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent Zhao , et al. (35 additional authors not shown)

Abstract: We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and are pre-trained on 1.56T words of public dialog data and web text. While model scaling alone can improve quality, it shows less improvements on safety and factual grounding. We demonstrate that fine-tuning with annotat… ▽ More We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and are pre-trained on 1.56T words of public dialog data and web text. While model scaling alone can improve quality, it shows less improvements on safety and factual grounding. We demonstrate that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding. The first challenge, safety, involves ensuring that the model's responses are consistent with a set of human values, such as preventing harmful suggestions and unfair bias. We quantify safety using a metric based on an illustrative set of human values, and we find that filtering candidate responses using a LaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promising approach to improving model safety. The second challenge, factual grounding, involves enabling the model to consult external knowledge sources, such as an information retrieval system, a language translator, and a calculator. We quantify factuality using a groundedness metric, and we find that our approach enables the model to generate responses grounded in known sources, rather than responses that merely sound plausible. Finally, we explore the use of LaMDA in the domains of education and content recommendations, and analyze their helpfulness and role consistency. △ Less

Submitted 10 February, 2022; v1 submitted 20 January, 2022; originally announced January 2022.

arXiv:2109.13815 [pdf, other]

doi 10.21437/Interspeech.2021-688

Articulatory Coordination for Speech Motor Tracking in Huntington Disease

Authors: Matthew Perez, Amrit Romana, Angela Roberts, Noelle Carlozzi, Jennifer Ann Miner, Praveen Dayalu, Emily Mower Provost

Abstract: Huntington Disease (HD) is a progressive disorder which often manifests in motor impairment. Motor severity (captured via motor score) is a key component in assessing overall HD severity. However, motor score evaluation involves in-clinic visits with a trained medical professional, which are expensive and not always accessible. Speech analysis provides an attractive avenue for tracking HD severity… ▽ More Huntington Disease (HD) is a progressive disorder which often manifests in motor impairment. Motor severity (captured via motor score) is a key component in assessing overall HD severity. However, motor score evaluation involves in-clinic visits with a trained medical professional, which are expensive and not always accessible. Speech analysis provides an attractive avenue for tracking HD severity because speech is easy to collect remotely and provides insight into motor changes. HD speech is typically characterized as having irregular articulation. With this in mind, acoustic features that can capture vocal tract movement and articulatory coordination are particularly promising for characterizing motor symptom progression in HD. In this paper, we present an experiment that uses Vocal Tract Coordination (VTC) features extracted from read speech to estimate a motor score. When using an elastic-net regression model, we find that VTC features significantly outperform other acoustic features across varied-length audio segments, which highlights the effectiveness of these features for both short- and long-form reading tasks. Lastly, we analyze the F-value scores of VTC features to visualize which channels are most related to motor score. This work enables future research efforts to consider VTC features for acoustic analyses which target HD motor symptomatology tracking. △ Less

Submitted 28 September, 2021; originally announced September 2021.

arXiv:2109.00908 [pdf, ps, other]

doi 10.3934/amc.2022021

Binary self-dual codes of various lengths with new weight enumerators from a modified bordered construction and neighbours

Authors: Joe Gildea, Adrian Korban, Adam Michael Roberts, Alexander Tylyshchak

Abstract: In this work, we define a modification of a bordered construction for self-dual codes which utilises $λ$-circulant matrices. We provide the necessary conditions for the construction to produce self-dual codes over finite commutative Frobenius rings of characteristic 2. Using the modified construction together with the neighbour construction, we construct many binary self-dual codes of lengths 54,… ▽ More In this work, we define a modification of a bordered construction for self-dual codes which utilises $λ$-circulant matrices. We provide the necessary conditions for the construction to produce self-dual codes over finite commutative Frobenius rings of characteristic 2. Using the modified construction together with the neighbour construction, we construct many binary self-dual codes of lengths 54, 68, 82 and 94 with weight enumerators that have previously not been known to exist. △ Less

Submitted 2 September, 2021; originally announced September 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:2108.09184, arXiv:2106.12355, arXiv:2102.10354

MSC Class: 94B05; 15B10; 15B33

arXiv:2108.09184 [pdf, ps, other]

doi 10.1016/j.disc.2023.113425

New binary self-dual codes of lengths 56, 62, 78, 92 and 94 from a bordered construction

Authors: Joe Gildea, Adrian Korban, Adam Michael Roberts, Alexander Tylyshchak

Abstract: In this paper, we present a new bordered construction for self-dual codes which employs $λ$-circulant matrices. We give the necessary conditions for our construction to produce self-dual codes over a finite commutative Frobenius ring of characteristic 2. Moreover, using our bordered construction together with the well-known building-up and neighbour methods, we construct many binary self-dual code… ▽ More In this paper, we present a new bordered construction for self-dual codes which employs $λ$-circulant matrices. We give the necessary conditions for our construction to produce self-dual codes over a finite commutative Frobenius ring of characteristic 2. Moreover, using our bordered construction together with the well-known building-up and neighbour methods, we construct many binary self-dual codes of lengths 56, 62, 78, 92 and 94 with parameters in their weight enumerators that were not known in the literature before. △ Less

Submitted 3 February, 2022; v1 submitted 20 August, 2021; originally announced August 2021.

Comments: corrected typos; other minor corrections. arXiv admin note: substantial text overlap with arXiv:2102.10354, arXiv:2106.12355, arXiv:2102.12326

MSC Class: 94B05; 15B10; 15B33

arXiv:2108.05056 [pdf, ps, other]

Group LCD and Group Reversible LCD Codes

Authors: Steven T. Dougherty, Joe Gildea, Adrian Korban, Adam M. Roberts

Abstract: In this paper, we give a new method for constructing LCD codes. We employ group rings and a well known map that sends group ring elements to a subring of the $n \times n$ matrices to obtain LCD codes. Our construction method guarantees that our LCD codes are also group codes, namely, the codes are ideals in a group ring. We show that with a certain condition on the group ring element $v,$ one can… ▽ More In this paper, we give a new method for constructing LCD codes. We employ group rings and a well known map that sends group ring elements to a subring of the $n \times n$ matrices to obtain LCD codes. Our construction method guarantees that our LCD codes are also group codes, namely, the codes are ideals in a group ring. We show that with a certain condition on the group ring element $v,$ one can construct non-trivial group LCD codes. Moreover, we also show that by adding more constraints on the group ring element $v,$ one can construct group LCD codes that are reversible. We present many examples of binary group LCD codes of which some are optimal and group reversible LCD codes with different parameters. △ Less

Submitted 11 August, 2021; originally announced August 2021.

Comments: 17 pages

MSC Class: 94B05

arXiv:2106.12355 [pdf, ps, other]

doi 10.1007/s10623-021-00976-3

New binary self-dual codes of lengths 80, 84 and 96 from composite matrices

Authors: Joe Gildea, Adrian Korban, Adam Michael Roberts

Abstract: In this work, we apply the idea of composite matrices arising from group rings to derive a number of different techniques for constructing self-dual codes over finite commutative Frobenius rings. By applying these techniques over different alphabets, we construct best known singly-even binary self-dual codes of lengths 80, 84 and 96 as well as doubly-even binary self-dual codes of length 96 that w… ▽ More In this work, we apply the idea of composite matrices arising from group rings to derive a number of different techniques for constructing self-dual codes over finite commutative Frobenius rings. By applying these techniques over different alphabets, we construct best known singly-even binary self-dual codes of lengths 80, 84 and 96 as well as doubly-even binary self-dual codes of length 96 that were not known in the literature before. △ Less

Submitted 23 June, 2021; originally announced June 2021.

Comments: arXiv admin note: text overlap with arXiv:2102.10354

MSC Class: 94B05; 16S34; 15B10; 15B33

arXiv:2106.10165 [pdf, other]

doi 10.1017/9781009023405

The Principles of Deep Learning Theory

Authors: Daniel A. Roberts, Sho Yaida, Boris Hanin

Abstract: This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are… ▽ More This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are described by nearly-Gaussian distributions, with the depth-to-width aspect ratio of the network controlling the deviations from the infinite-width Gaussian description. We explain how these effectively-deep networks learn nontrivial representations from training and more broadly analyze the mechanism of representation learning for nonlinear models. From a nearly-kernel-methods perspective, we find that the dependence of such models' predictions on the underlying learning algorithm can be expressed in a simple and universal way. To obtain these results, we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem. We further explain how RG flow leads to near-universal behavior and lets us categorize networks built from different activation functions into universality classes. Altogether, we show that the depth-to-width ratio governs the effective model complexity of the ensemble of trained networks. By using information-theoretic techniques, we estimate the optimal aspect ratio at which we expect the network to be practically most useful and show how residual connections can be used to push this scale to arbitrary depths. With these tools, we can learn in detail about the inductive bias of architectures, hyperparameters, and optimizers. △ Less

Submitted 24 August, 2021; v1 submitted 18 June, 2021; originally announced June 2021.

Comments: 471 pages, to be published by Cambridge University Press; v2: hyperlinks fixed, index added

Report number: MIT-CTP/5306

Journal ref: Cambridge University Press (2022)

arXiv:2105.13626 [pdf, other]

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Authors: Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel

Abstract: Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison, token-free models that operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pi… ▽ More Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison, token-free models that operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments. △ Less

Submitted 7 March, 2022; v1 submitted 28 May, 2021; originally announced May 2021.

Comments: To be published in TACL 2022

arXiv:2104.13554 [pdf, other]

Mesoscale simulation of woven composite design decisions

Authors: Lincoln N. Collins, Scott A. Roberts

Abstract: Characterizing the connection between material design decisions/parameters and their effective properties allows for accelerated materials development and optimization. We present a global sensitivity analysis of woven composite thermophysical properties, including density, volume fraction, thermal conductivity, specific heat, moduli, permeability, and tortuosity, predicted using mesoscale finite… ▽ More Characterizing the connection between material design decisions/parameters and their effective properties allows for accelerated materials development and optimization. We present a global sensitivity analysis of woven composite thermophysical properties, including density, volume fraction, thermal conductivity, specific heat, moduli, permeability, and tortuosity, predicted using mesoscale finite element simulations. The mesoscale simulations use microscale approximations for the tow and matrix phases. We performed Latin hypercube sampling of viable input parameter ranges, and the resulting effective property distributions are analyzed using a surrogate model to determine the correlations between material parameters and responses, interactions between properties, and finally Sobol' indices and sensitivities. We demonstrate that both constituent physical properties and the mesoscale geometry strongly influence the composite material properties. △ Less

Submitted 27 April, 2021; originally announced April 2021.

arXiv:2104.04874 [pdf, ps, other]

SGD Implicitly Regularizes Generalization Error

Authors: Daniel A. Roberts

Abstract: We derive a simple and model-independent formula for the change in the generalization gap due to a gradient descent update. We then compare the change in the test error for stochastic gradient descent to the change in test error from an equivalent number of gradient descent updates and show explicitly that stochastic gradient descent acts to regularize generalization error by decorrelating nearby… ▽ More We derive a simple and model-independent formula for the change in the generalization gap due to a gradient descent update. We then compare the change in the test error for stochastic gradient descent to the change in test error from an equivalent number of gradient descent updates and show explicitly that stochastic gradient descent acts to regularize generalization error by decorrelating nearby updates. These calculations depends on the details of the model only through the mean and covariance of the gradient distribution, which may be readily measured for particular models of interest. We discuss further improvements to these calculations and comment on possible implications for stochastic optimization. △ Less

Submitted 10 April, 2021; originally announced April 2021.

Comments: First appeared at the "Workshop on Integration of Deep Learning Theories" at NeurIPS in 2018 and has been available since then at https://research.fb.com/publications/sgd-implicitly-regularizes-generalization-error/

arXiv:2104.00008 [pdf, other]

Why is AI hard and Physics simple?

Authors: Daniel A. Roberts

Abstract: We discuss why AI is hard and why physics is simple. We discuss how physical intuition and the approach of theoretical physics can be brought to bear on the field of artificial intelligence and specifically machine learning. We suggest that the underlying project of machine learning and the underlying project of physics are strongly coupled through the principle of sparsity, and we call upon theor… ▽ More We discuss why AI is hard and why physics is simple. We discuss how physical intuition and the approach of theoretical physics can be brought to bear on the field of artificial intelligence and specifically machine learning. We suggest that the underlying project of machine learning and the underlying project of physics are strongly coupled through the principle of sparsity, and we call upon theoretical physicists to work on AI as physicists. As a first step in that direction, we discuss an upcoming book on the principles of deep learning theory that attempts to realize this approach. △ Less

Submitted 31 March, 2021; originally announced April 2021.

Comments: written for a special issue of Machine Learning: Science and Technology as an invited perspective piece

Report number: MIT-CTP/5269

arXiv:2102.12326 [pdf, ps, other]

doi 10.1007/s00200-022-00589-w

Quaternary Hermitian self-dual codes of lengths 26, 32, 36, 38 and 40 from modifications of well-known circulant constructions

Authors: Adam Michael Roberts

Abstract: In this work, we give three new techniques for constructing Hermitian self-dual codes over commutative Frobenius rings with a non-trivial involutory automorphism using $λ$-circulant matrices. The new constructions are derived as modifications of various well-known circulant constructions of self-dual codes. Applying these constructions together with the building-up construction, we construct many… ▽ More In this work, we give three new techniques for constructing Hermitian self-dual codes over commutative Frobenius rings with a non-trivial involutory automorphism using $λ$-circulant matrices. The new constructions are derived as modifications of various well-known circulant constructions of self-dual codes. Applying these constructions together with the building-up construction, we construct many new best known quaternary Hermitian self-dual codes of lengths 26, 32, 36, 38 and 40. △ Less

Submitted 24 February, 2021; originally announced February 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:2102.10354

arXiv:2102.11972 [pdf, other]

Do Transformer Modifications Transfer Across Implementations and Applications?

Authors: Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, Yanqi Zhou, Wei Li, Nan Ding, Jake Marcus, Adam Roberts, Colin Raffel

Abstract: The research community has proposed copious modifications to the Transformer architecture since it was introduced over three years ago, relatively few of which have seen widespread adoption. In this paper, we comprehensively evaluate many of these modifications in a shared experimental setting that covers most of the common uses of the Transformer in natural language processing. Surprisingly, we f… ▽ More The research community has proposed copious modifications to the Transformer architecture since it was introduced over three years ago, relatively few of which have seen widespread adoption. In this paper, we comprehensively evaluate many of these modifications in a shared experimental setting that covers most of the common uses of the Transformer in natural language processing. Surprisingly, we find that most modifications do not meaningfully improve performance. Furthermore, most of the Transformer variants we found beneficial were either developed in the same codebase that we used or are relatively minor changes. We conjecture that performance improvements may strongly depend on implementation details and correspondingly make some recommendations for improving the generality of experimental results. △ Less

Submitted 10 September, 2021; v1 submitted 23 February, 2021; originally announced February 2021.

Comments: To appear at EMNLP 2021 as a conference paper

arXiv:2102.10354 [pdf, ps, other]

doi 10.1016/j.ffa.2021.101876

New binary self-dual codes of lengths 56, 58, 64, 80 and 92 from a modification of the four circulant construction

Authors: Joe Gildea, Adrian Korban, Adam Michael Roberts

Abstract: In this work, we give a new technique for constructing self-dual codes over commutative Frobenius rings using $λ$-circulant matrices. The new construction was derived as a modification of the well-known four circulant construction of self-dual codes. Applying this technique together with the building-up construction, we construct singly-even binary self-dual codes of lengths 56, 58, 64, 80 and 92… ▽ More In this work, we give a new technique for constructing self-dual codes over commutative Frobenius rings using $λ$-circulant matrices. The new construction was derived as a modification of the well-known four circulant construction of self-dual codes. Applying this technique together with the building-up construction, we construct singly-even binary self-dual codes of lengths 56, 58, 64, 80 and 92 that were not known in the literature before. Singly-even self-dual codes of length 80 with $β\in\{2,4,5,6,8\}$ in their weight enumerators are constructed for the first time in the literature. △ Less

Submitted 23 June, 2021; v1 submitted 20 February, 2021; originally announced February 2021.

Comments: corrected typos; added references

MSC Class: 94B05; 15B10; 15B33

arXiv:2102.08380 [pdf, other]

doi 10.1007/JHEP04(2021)280

Topological Obstructions to Autoencoding

Authors: Joshua Batson, C. Grace Haaf, Yonatan Kahn, Daniel A. Roberts

Abstract: Autoencoders have been proposed as a powerful tool for model-independent anomaly detection in high-energy physics. The operating principle is that events which do not belong to the space of training data will be reconstructed poorly, thus flagging them as anomalies. We point out that in a variety of examples of interest, the connection between large reconstruction error and anomalies is not so cle… ▽ More Autoencoders have been proposed as a powerful tool for model-independent anomaly detection in high-energy physics. The operating principle is that events which do not belong to the space of training data will be reconstructed poorly, thus flagging them as anomalies. We point out that in a variety of examples of interest, the connection between large reconstruction error and anomalies is not so clear. In particular, for data sets with nontrivial topology, there will always be points that erroneously seem anomalous due to global issues. Conversely, neural networks typically have an inductive bias or prior to locally interpolate such that undersampled or rare events may be reconstructed with small error, despite actually being the desired anomalies. Taken together, these facts are in tension with the simple picture of the autoencoder as an anomaly detector. Using a series of illustrative low-dimensional examples, we show explicitly how the intrinsic and extrinsic topology of the dataset affects the behavior of an autoencoder and how this topology is manifested in the latent space representation during training. We ground this analysis in the discussion of a mock "bump hunt" in which the autoencoder fails to identify an anomalous "signal" for reasons tied to the intrinsic topology of $n$-particle phase space. △ Less

Submitted 3 May, 2021; v1 submitted 16 February, 2021; originally announced February 2021.

Comments: 24 + 20 pages, 26 figures; no autoencoders were harmed in the making of this project. v2: JHEP published version

Report number: MIT-CTP/5264

Journal ref: JHEP04(2021)280

arXiv:2101.00133 [pdf, other]

NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned

Authors: Sewon Min, Jordan Boyd-Graber, Chris Alberti, Danqi Chen, Eunsol Choi, Michael Collins, Kelvin Guu, Hannaneh Hajishirzi, Kenton Lee, Jennimaria Palomaki, Colin Raffel, Adam Roberts, Tom Kwiatkowski, Patrick Lewis, Yuxiang Wu, Heinrich Küttler, Linqing Liu, Pasquale Minervini, Pontus Stenetorp, Sebastian Riedel, Sohee Yang, Minjoon Seo, Gautier Izacard, Fabio Petroni, Lucas Hosseini , et al. (28 additional authors not shown)

Abstract: We review the EfficientQA competition from NeurIPS 2020. The competition focused on open-domain question answering (QA), where systems take natural language questions as input and return natural language answers. The aim of the competition was to build systems that can predict correct answers while also satisfying strict on-disk memory budgets. These memory budgets were designed to encourage conte… ▽ More We review the EfficientQA competition from NeurIPS 2020. The competition focused on open-domain question answering (QA), where systems take natural language questions as input and return natural language answers. The aim of the competition was to build systems that can predict correct answers while also satisfying strict on-disk memory budgets. These memory budgets were designed to encourage contestants to explore the trade-off between storing retrieval corpora or the parameters of learned models. In this report, we describe the motivation and organization of the competition, review the best submissions, and analyze system predictions to inform a discussion of evaluation for open-domain QA. △ Less

Submitted 19 September, 2021; v1 submitted 31 December, 2020; originally announced January 2021.

Comments: 26 pages; Published in Proceedings of Machine Learning Research (PMLR), NeurIPS 2020 Competition and Demonstration Track

arXiv:2012.09913 [pdf, other]

doi 10.1038/s41467-021-25493-8

Quantifying the unknown impact of segmentation uncertainty on image-based simulations

Authors: Michael C. Krygier, Tyler LaBonte, Carianne Martinez, Chance Norris, Krish Sharma, Lincoln N. Collins, Partha P. Mukherjee, Scott A. Roberts

Abstract: Image-based simulation, the use of 3D images to calculate physical quantities, fundamentally relies on image segmentation to create the computational geometry. However, this process introduces image segmentation uncertainty because there is a variety of different segmentation tools (both manual and machine-learning-based) that will each produce a unique and valid segmentation. First, we demonstrat… ▽ More Image-based simulation, the use of 3D images to calculate physical quantities, fundamentally relies on image segmentation to create the computational geometry. However, this process introduces image segmentation uncertainty because there is a variety of different segmentation tools (both manual and machine-learning-based) that will each produce a unique and valid segmentation. First, we demonstrate that these variations propagate into the physics simulations, compromising the resulting physics quantities. Second, we propose a general framework for rapidly quantifying segmentation uncertainty. Through the creation and sampling of segmentation uncertainty probability maps, we systematically and objectively create uncertainty distributions of the physics quantities. We show that physics quantity uncertainty distributions can follow a Normal distribution, but, in more complicated physics simulations, the resulting uncertainty distribution can be both nonintuitive and surprisingly nontrivial. We also establish that simply bounding the uncertainty can fail in situations that are sensitive to image segmentation. While our work does not eliminate segmentation uncertainty, it makes visible the previously unrecognized range of uncertainty currently plaguing image-based simulation, enabling more credible simulations. △ Less

Submitted 9 September, 2021; v1 submitted 17 December, 2020; originally announced December 2020.

Journal ref: Nature Communications 12, 5414 (2021)

arXiv:2012.08919 [pdf, ps, other]

Multilingual Evidence Retrieval and Fact Verification to Combat Global Disinformation: The Power of Polyglotism

Authors: Denisa A. O. Roberts

Abstract: This article investigates multilingual evidence retrieval and fact verification as a step to combat global disinformation, a first effort of this kind, to the best of our knowledge. The goal is building multilingual systems that retrieve in evidence-rich languages to verify claims in evidence-poor languages that are more commonly targeted by disinformation. To this end, our EnmBERT fact verificati… ▽ More This article investigates multilingual evidence retrieval and fact verification as a step to combat global disinformation, a first effort of this kind, to the best of our knowledge. The goal is building multilingual systems that retrieve in evidence-rich languages to verify claims in evidence-poor languages that are more commonly targeted by disinformation. To this end, our EnmBERT fact verification system shows evidence of transfer learning ability and 400 example mixed English-Romanian dataset is made available for cross-lingual transfer learning evaluation. △ Less

Submitted 19 January, 2021; v1 submitted 16 December, 2020; originally announced December 2020.

Comments: Accepted ECIR 2021

Showing 1–50 of 82 results for author: Roberts, A