-
Anatomy of Industrial Scale Multilingual ASR
Authors:
Francis McCann Ramirez,
Luka Chkhetiani,
Andrew Ehrenberg,
Robert McHardy,
Rami Botros,
Yash Khare,
Andrea Vanzo,
Taufiquzzaman Peyash,
Gabriel Oexle,
Michael Liang,
Ilya Sklyar,
Enver Fakhan,
Ahmed Etefy,
Daniel McCrystal,
Sam Flamini,
Domenic Donato,
Takuya Yoshioka
Abstract:
This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed descriptio…
▽ More
This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.
△ Less
Submitted 16 April, 2024; v1 submitted 15 April, 2024;
originally announced April 2024.
-
Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping
Authors:
Kevin Zhang,
Luka Chkhetiani,
Francis McCann Ramirez,
Yash Khare,
Andrea Vanzo,
Michael Liang,
Sergio Ramirez Martin,
Gabriel Oexle,
Ruben Bousbib,
Taufiquzzaman Peyash,
Michael Nguyen,
Dillon Pulliam,
Domenic Donato
Abstract:
This paper presents Conformer-1, an end-to-end Automatic Speech Recognition (ASR) model trained on an extensive dataset of 570k hours of speech audio data, 91% of which was acquired from publicly available sources. To achieve this, we perform Noisy Student Training after generating pseudo-labels for the unlabeled public data using a strong Conformer RNN-T baseline model. The addition of these pseu…
▽ More
This paper presents Conformer-1, an end-to-end Automatic Speech Recognition (ASR) model trained on an extensive dataset of 570k hours of speech audio data, 91% of which was acquired from publicly available sources. To achieve this, we perform Noisy Student Training after generating pseudo-labels for the unlabeled public data using a strong Conformer RNN-T baseline model. The addition of these pseudo-labeled data results in remarkable improvements in relative Word Error Rate (WER) by 11.5% and 24.3% for our asynchronous and realtime models, respectively. Additionally, the model is more robust to background noise owing to the addition of these data. The results obtained in this study demonstrate that the incorporation of pseudo-labeled publicly available data is a highly effective strategy for improving ASR accuracy and noise robustness.
△ Less
Submitted 12 April, 2024; v1 submitted 10 April, 2024;
originally announced April 2024.
-
MAD for Robust Reinforcement Learning in Machine Translation
Authors:
Domenic Donato,
Lei Yu,
Wang Ling,
Chris Dyer
Abstract:
We introduce a new distributed policy gradient algorithm and show that it outperforms existing reward-aware training procedures such as REINFORCE, minimum risk training (MRT) and proximal policy optimization (PPO) in terms of training stability and generalization performance when optimizing machine translation models. Our algorithm, which we call MAD (on account of using the mean absolute deviatio…
▽ More
We introduce a new distributed policy gradient algorithm and show that it outperforms existing reward-aware training procedures such as REINFORCE, minimum risk training (MRT) and proximal policy optimization (PPO) in terms of training stability and generalization performance when optimizing machine translation models. Our algorithm, which we call MAD (on account of using the mean absolute deviation in the importance weighting calculation), has distributed data generators sampling multiple candidates per source sentence on worker nodes, while a central learner updates the policy. MAD depends crucially on two variance reduction strategies: (1) a conditional reward normalization method that ensures each source sentence has both positive and negative reward translation examples and (2) a new robust importance weighting scheme that acts as a conditional entropy regularizer. Experiments on a variety of translation tasks show that policies learned using the MAD algorithm perform very well when using both greedy decoding and beam search, and that the learned policies are sensitive to the specific reward used during training.
△ Less
Submitted 18 July, 2022;
originally announced July 2022.
-
Enabling arbitrary translation objectives with Adaptive Tree Search
Authors:
Wang Ling,
Wojciech Stokowiec,
Domenic Donato,
Laurent Sartran,
Lei Yu,
Austin Matthews,
Chris Dyer
Abstract:
We introduce an adaptive tree search algorithm, that can find high-scoring outputs under translation models that make no assumptions about the form or structure of the search objective. This algorithm -- a deterministic variant of Monte Carlo tree search -- enables the exploration of new kinds of models that are unencumbered by constraints imposed to make decoding tractable, such as autoregressivi…
▽ More
We introduce an adaptive tree search algorithm, that can find high-scoring outputs under translation models that make no assumptions about the form or structure of the search objective. This algorithm -- a deterministic variant of Monte Carlo tree search -- enables the exploration of new kinds of models that are unencumbered by constraints imposed to make decoding tractable, such as autoregressivity or conditional independence assumptions. When applied to autoregressive models, our algorithm has different biases than beam search has, which enables a new analysis of the role of decoding bias in autoregressive models. Empirically, we show that our adaptive tree search algorithm finds outputs with substantially better model scores compared to beam search in autoregressive models, and compared to reranking techniques in models whose scores do not decompose additively with respect to the words in the output. We also characterise the correlation of several translation model objectives with respect to BLEU. We find that while some standard models are poorly calibrated and benefit from the beam search bias, other often more robust models (autoregressive models tuned to maximize expected automatic metric scores, the noisy channel model and a newly proposed objective) benefit from increasing amounts of search using our proposed decoder, whereas the beam search bias limits the improvements obtained from such objectives. Thus, we argue that as models improve, the improvements may be masked by over-reliance on beam search or reranking based methods.
△ Less
Submitted 23 February, 2022;
originally announced February 2022.
-
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Authors:
Jack W. Rae,
Sebastian Borgeaud,
Trevor Cai,
Katie Millican,
Jordan Hoffmann,
Francis Song,
John Aslanides,
Sarah Henderson,
Roman Ring,
Susannah Young,
Eliza Rutherford,
Tom Hennigan,
Jacob Menick,
Albin Cassirer,
Richard Powell,
George van den Driessche,
Lisa Anne Hendricks,
Maribeth Rauh,
Po-Sen Huang,
Amelia Glaese,
Johannes Welbl,
Sumanth Dathathri,
Saffron Huang,
Jonathan Uesato,
John Mellor
, et al. (55 additional authors not shown)
Abstract:
Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gop…
▽ More
Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.
△ Less
Submitted 21 January, 2022; v1 submitted 8 December, 2021;
originally announced December 2021.
-
Diverse Pretrained Context Encodings Improve Document Translation
Authors:
Domenic Donato,
Lei Yu,
Chris Dyer
Abstract:
We propose a new architecture for adapting a sentence-level sequence-to-sequence transformer by incorporating multiple pretrained document context signals and assess the impact on translation performance of (1) different pretraining approaches for generating these signals, (2) the quantity of parallel data for which document context is available, and (3) conditioning on source, target, or source a…
▽ More
We propose a new architecture for adapting a sentence-level sequence-to-sequence transformer by incorporating multiple pretrained document context signals and assess the impact on translation performance of (1) different pretraining approaches for generating these signals, (2) the quantity of parallel data for which document context is available, and (3) conditioning on source, target, or source and target contexts. Experiments on the NIST Chinese-English, and IWSLT and WMT English-German tasks support four general conclusions: that using pretrained context representations markedly improves sample efficiency, that adequate parallel data resources are crucial for learning to use document context, that jointly conditioning on multiple context representations outperforms any single representation, and that source context is more valuable for translation performance than target side context. Our best multi-context model consistently outperforms the best existing context-aware transformers.
△ Less
Submitted 7 June, 2021;
originally announced June 2021.
-
Schroedinger-like PageRank equation and localization in the WWW
Authors:
Nicola Perra,
Vinko Zlatic,
Alessandro Chessa,
Claudio Conti,
Debora Donato,
Guido Caldarelli
Abstract:
The WorldWide Web is one of the most important communication systems we use in our everyday life. Despite its central role, the growth and the development of the WWW is not controlled by any central authority. This situation has created a huge ensemble of connections whose complexity can be fruitfully described and quantified by network theory. One important application that allows to sort out t…
▽ More
The WorldWide Web is one of the most important communication systems we use in our everyday life. Despite its central role, the growth and the development of the WWW is not controlled by any central authority. This situation has created a huge ensemble of connections whose complexity can be fruitfully described and quantified by network theory. One important application that allows to sort out the information present in these connections is given by the PageRank alghorithm. Computation of this quantity is usually made iteratively with a large use of computational time. In this paper we show that the PageRank can be expressed in terms of a wave function obeying a Schroedinger-like equation. In particular the topological disorder given by the unbalance of outgoing and ingoing links between pages, induces wave function and potential structuring. This allows to directly localize the pages with the largest score. Through this new representation we can now compute the PageRank without iterative techniques. For most of the cases of interest our method is faster than the original one. Our results also clarify the role of topology in the diffusion of information within complex networks. The whole approach opens the possibility to novel techniques inspired by quantum physics for the analysis of the WWW properties.
△ Less
Submitted 28 July, 2008;
originally announced July 2008.
-
Preferential attachment in the growth of social networks: the case of Wikipedia
Authors:
A. Capocci,
V. D. P. Servedio,
F. Colaiori,
L. S. Buriol,
D. Donato,
S. Leonardi,
G. Caldarelli
Abstract:
We present an analysis of the statistical properties and growth of the free on-line encyclopedia Wikipedia. By describing topics by vertices and hyperlinks between them as edges, we can represent this encyclopedia as a directed graph. The topological properties of this graph are in close analogy with that of the World Wide Web, despite the very different growth mechanism. In particular we measur…
▽ More
We present an analysis of the statistical properties and growth of the free on-line encyclopedia Wikipedia. By describing topics by vertices and hyperlinks between them as edges, we can represent this encyclopedia as a directed graph. The topological properties of this graph are in close analogy with that of the World Wide Web, despite the very different growth mechanism. In particular we measure a scale--invariant distribution of the in-- and out-- degree and we are able to reproduce these features by means of a simple statistical model. As a major consequence, Wikipedia growth can be described by local rules such as the preferential attachment mechanism, though users can act globally on the network.
△ Less
Submitted 27 February, 2006; v1 submitted 3 February, 2006;
originally announced February 2006.