Zum Hauptinhalt springen

Showing 1–22 of 22 results for author: Huang, W R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2401.12789  [pdf, other

    cs.CL cs.SD eess.AS

    Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

    Authors: W. Ronny Huang, Cyril Allauzen, Tongzhou Chen, Kilol Gupta, Ke Hu, James Qin, Yu Zhang, Yongqiang Wang, Shuo-Yiin Chang, Tara N. Sainath

    Abstract: In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck. We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. Our approach combines the Universal Speech Model (USM) and the PaLM 2 language model in per-segment scoring mode, achieving an average… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

    Comments: ICASSP 2024

  2. arXiv:2306.08133  [pdf, ps, other

    eess.AS cs.CL

    Large-scale Language Model Rescoring on Long-form Data

    Authors: Tongzhou Chen, Cyril Allauzen, Yinghui Huang, Daniel Park, David Rybach, W. Ronny Huang, Rodrigo Cabrera, Kartik Audhkhasi, Bhuvana Ramabhadran, Pedro J. Moreno, Michael Riley

    Abstract: In this work, we study the impact of Large-scale Language Models (LLM) on Automated Speech Recognition (ASR) of YouTube videos, which we use as a source for long-form ASR. We demonstrate up to 8\% relative reduction in Word Error Eate (WER) on US English (en-us) and code-switched Indian English (en-in) long-form ASR test sets and a reduction of up to 30\% relative on Salient Term Error Rate (STER)… ▽ More

    Submitted 5 September, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: 5 pages, accepted in ICASSP 2023

    Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  3. arXiv:2305.18419  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Semantic Segmentation with Bidirectional Language Models Improves Long-form ASR

    Authors: W. Ronny Huang, Hao Zhang, Shankar Kumar, Shuo-yiin Chang, Tara N. Sainath

    Abstract: We propose a method of segmenting long-form speech by separating semantically complete sentences within the utterance. This prevents the ASR decoder from needlessly processing faraway context while also preventing it from missing relevant context within the current sentence. Semantically complete sentence boundaries are typically demarcated by punctuation in written text; but unfortunately, spoken… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

    Comments: Interspeech 2023. First 3 authors contributed equally

  4. arXiv:2211.15432  [pdf, other

    cs.CL

    E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

    Authors: W. Ronny Huang, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, David Rybach, Robert David, Rohit Prabhavalkar, Cyril Allauzen, Cal Peyser, Trevor D. Strohman

    Abstract: We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated wi… ▽ More

    Submitted 5 March, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

    Comments: ICASSP 2023

  5. arXiv:2210.17049  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Modular Hybrid Autoregressive Transducer

    Authors: Zhong Meng, Tongzhou Chen, Rohit Prabhavalkar, Yu Zhang, Gary Wang, Kartik Audhkhasi, Jesse Emond, Trevor Strohman, Bhuvana Ramabhadran, W. Ronny Huang, Ehsan Variani, Yinghui Huang, Pedro J. Moreno

    Abstract: Text-only adaptation of a transducer model remains challenging for end-to-end speech recognition since the transducer has no clearly separated acoustic model (AM), language model (LM) or blank model. In this work, we propose a modular hybrid autoregressive transducer (MHAT) that has structurally separated label and blank decoders to predict label and blank distributions, respectively, along with a… ▽ More

    Submitted 16 February, 2023; v1 submitted 30 October, 2022; originally announced October 2022.

    Comments: 8 pages, 1 figure, in SLT 2022

    Journal ref: 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar

  6. arXiv:2204.10749  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

    Authors: W. Ronny Huang, Shuo-yiin Chang, David Rybach, Rohit Prabhavalkar, Tara N. Sainath, Cyril Allauzen, Cal Peyser, Zhiyun Lu

    Abstract: Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours in length is an ongoing challenge in speech recognition. A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundary locations based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for… ▽ More

    Submitted 15 June, 2022; v1 submitted 22 April, 2022; originally announced April 2022.

    Comments: Interspeech 2022

  7. arXiv:2204.09606  [pdf, other

    cs.CL cs.CR cs.LG cs.SD eess.AS

    Detecting Unintended Memorization in Language-Model-Fused ASR

    Authors: W. Ronny Huang, Steve Chien, Om Thakkar, Rajiv Mathews

    Abstract: End-to-end (E2E) models are often being accompanied by language models (LMs) via shallow fusion for boosting their overall quality as well as recognition of rare words. At the same time, several prior works show that LMs are susceptible to unintentionally memorizing rare or unique sequences in the training data. In this work, we design a framework for detecting memorization of random textual seque… ▽ More

    Submitted 28 June, 2022; v1 submitted 20 April, 2022; originally announced April 2022.

    Comments: Interspeech 2022

  8. arXiv:2203.05008  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

    Authors: W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor Strohman, Shankar Kumar

    Abstract: Language model fusion helps smart assistants recognize words which are rare in acoustic data but abundant in text-only corpora (typed search logs). However, such corpora have properties that hinder downstream performance, including being (1) too large, (2) beset with domain-mismatched content, and (3) heavy-headed rather than heavy-tailed (excessively many duplicate search queries such as "weather… ▽ More

    Submitted 15 June, 2022; v1 submitted 9 March, 2022; originally announced March 2022.

    Comments: Interspeech 2022

  9. arXiv:2202.08171  [pdf, other

    cs.CL cs.LG

    Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model

    Authors: Hao Zhang, You-Chi Cheng, Shankar Kumar, W. Ronny Huang, Mingqing Chen, Rajiv Mathews

    Abstract: Capitalization normalization (truecasing) is the task of restoring the correct case (uppercase or lowercase) of noisy text. We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model. We use the truecaser to normalize user-generated text in a Federated Learning framework for language modeling. A case-aware language model trained on this n… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2108.11943

  10. arXiv:2104.14830  [pdf, other

    cs.CL cs.SD eess.AS

    Scaling End-to-End Models for Large-Scale Multilingual ASR

    Authors: Bo Li, Ruoming Pang, Tara N. Sainath, Anmol Gulati, Yu Zhang, James Qin, Parisa Haghani, W. Ronny Huang, Min Ma, Junwen Bai

    Abstract: Building ASR models across many languages is a challenging multi-task learning problem due to large variations and heavily unbalanced data. Existing work has shown positive transfer from high resource to low resource languages. However, degradations on high resource languages are commonly observed due to interference from the heterogeneous multilingual data and reduction in per-language capacity.… ▽ More

    Submitted 11 September, 2021; v1 submitted 30 April, 2021; originally announced April 2021.

    Comments: ASRU 2021

  11. arXiv:2104.04552  [pdf, other

    cs.CL cs.SD eess.AS

    Lookup-Table Recurrent Language Models for Long Tail Speech Recognition

    Authors: W. Ronny Huang, Tara N. Sainath, Cal Peyser, Shankar Kumar, David Rybach, Trevor Strohman

    Abstract: We introduce Lookup-Table Language Models (LookupLM), a method for scaling up the size of RNN language models with only a constant increase in the floating point operations, by increasing the expressivity of the embedding table. In particular, we instantiate an (additional) embedding table which embeds the previous n-gram token sequence, rather than a single token. This allows the embedding table… ▽ More

    Submitted 6 June, 2021; v1 submitted 9 April, 2021; originally announced April 2021.

    Comments: Presented as conference paper at Interspeech 2021

  12. arXiv:2102.08098  [pdf, other

    cs.LG cs.CL cs.CV

    GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training

    Authors: Chen Zhu, Renkun Ni, Zheng Xu, Kezhi Kong, W. Ronny Huang, Tom Goldstein

    Abstract: Innovations in neural architectures have fostered significant breakthroughs in language modeling and computer vision. Unfortunately, novel architectures often result in challenging hyper-parameter choices and training instability if the network parameters are not properly initialized. A number of architecture-specific initialization schemes have been proposed, but these schemes are not always port… ▽ More

    Submitted 24 November, 2021; v1 submitted 16 February, 2021; originally announced February 2021.

    Comments: NeurIPS 2021, fixing typos

  13. arXiv:2009.02276  [pdf, other

    cs.CV cs.LG

    Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching

    Authors: Jonas Geiping, Liam Fowl, W. Ronny Huang, Wojciech Czaja, Gavin Taylor, Michael Moeller, Tom Goldstein

    Abstract: Data Poisoning attacks modify training data to maliciously control a model trained on such data. In this work, we focus on targeted poisoning attacks which cause a reclassification of an unmodified test image and as such breach model integrity. We consider a particularly malicious poisoning attack that is both "from scratch" and "clean label", meaning we analyze an attack that successfully works a… ▽ More

    Submitted 10 May, 2021; v1 submitted 4 September, 2020; originally announced September 2020.

    Comments: First two authors contributed equally. Last two authors contributed equally. 21 pages, 11 figures. Published at ICLR 2021

  14. arXiv:2004.00225  [pdf, other

    cs.LG cs.AI cs.CR cs.CV stat.ML

    MetaPoison: Practical General-purpose Clean-label Data Poisoning

    Authors: W. Ronny Huang, Jonas Geiping, Liam Fowl, Gavin Taylor, Tom Goldstein

    Abstract: Data poisoning -- the process by which an attacker takes control of a model by making imperceptible changes to a subset of the training data -- is an emerging threat in the context of neural networks. Existing attacks for data poisoning neural networks have relied on hand-crafted heuristics, because solving the poisoning problem directly via bilevel optimization is generally thought of as intracta… ▽ More

    Submitted 20 February, 2021; v1 submitted 1 April, 2020; originally announced April 2020.

    Comments: Conference paper at NeurIPS 2020. First two authors contributed equally

  15. arXiv:1910.07070  [pdf, other

    cs.CV cs.LG cs.NE

    DeepErase: Weakly Supervised Ink Artifact Removal in Document Text Images

    Authors: W. Ronny Huang, Yike Qi, Qianqian Li, Jonathan Degange

    Abstract: Paper-intensive industries like insurance, law, and government have long leveraged optical character recognition (OCR) to automatically transcribe hordes of scanned documents into text strings for downstream processing. Even in 2019, there are still many scanned documents and mail that come into businesses in non-digital format. Text to be extracted from real world documents is often nestled insid… ▽ More

    Submitted 16 January, 2020; v1 submitted 15 October, 2019; originally announced October 2019.

    Comments: Conference paper at WACV 2020. First two authors contributed equally

  16. arXiv:1909.13374  [pdf, other

    cs.LG cs.CV cs.NE

    Deep k-NN Defense against Clean-label Data Poisoning Attacks

    Authors: Neehar Peri, Neal Gupta, W. Ronny Huang, Liam Fowl, Chen Zhu, Soheil Feizi, Tom Goldstein, John P. Dickerson

    Abstract: Targeted clean-label data poisoning is a type of adversarial attack on machine learning systems in which an adversary injects a few correctly-labeled, minimally-perturbed samples into the training data, causing a model to misclassify a particular test sample during inference. Although defenses have been proposed for general poisoning attacks, no reliable defense for clean-label attacks has been de… ▽ More

    Submitted 13 August, 2020; v1 submitted 29 September, 2019; originally announced September 2019.

    Comments: Accepted to ECCV 2020 Workshop - Adversarial Robustness in the Real World (AROW). First three authors contributed equally

  17. arXiv:1906.03291  [pdf, other

    cs.LG cs.NE stat.ML

    Understanding Generalization through Visualizations

    Authors: W. Ronny Huang, Zeyad Emam, Micah Goldblum, Liam Fowl, J. K. Terry, Furong Huang, Tom Goldstein

    Abstract: The power of neural networks lies in their ability to generalize to unseen data, yet the underlying reasons for this phenomenon remain elusive. Numerous rigorous attempts have been made to explain generalization, but available bounds are still quite loose, and analysis does not always lead to true understanding. The goal of this work is to make generalization more intuitive. Using visualization me… ▽ More

    Submitted 14 November, 2020; v1 submitted 7 June, 2019; originally announced June 2019.

    Comments: 8 pages (excluding acknowledgments and references), 8 figures

  18. arXiv:1905.05897  [pdf, other

    stat.ML cs.CR cs.LG

    Transferable Clean-Label Poisoning Attacks on Deep Neural Nets

    Authors: Chen Zhu, W. Ronny Huang, Ali Shafahi, Hengduo Li, Gavin Taylor, Christoph Studer, Tom Goldstein

    Abstract: Clean-label poisoning attacks inject innocuous looking (and "correctly" labeled) poison images into training data, causing a model to misclassify a targeted image after being trained on this data. We consider transferable poisoning attacks that succeed without access to the victim network's outputs, architecture, or (in some cases) training data. To achieve this, we propose a new "polytope attack"… ▽ More

    Submitted 16 May, 2019; v1 submitted 14 May, 2019; originally announced May 2019.

    Comments: Accepted to ICML2019

  19. arXiv:1904.06963  [pdf, other

    cs.LG stat.ML

    The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent

    Authors: Karthik A. Sankararaman, Soham De, Zheng Xu, W. Ronny Huang, Tom Goldstein

    Abstract: This paper studies how neural network architecture affects the speed of training. We introduce a simple concept called gradient confusion to help formally analyze this. When gradient confusion is high, stochastic gradients produced by different data samples may be negatively correlated, slowing down convergence. But when gradient confusion is low, data samples interact harmoniously, and training p… ▽ More

    Submitted 6 July, 2020; v1 submitted 15 April, 2019; originally announced April 2019.

    Comments: ICML 2020 camera-ready version

  20. arXiv:1811.10791  [pdf, other

    cs.LG stat.ML

    Accurate, Data-Efficient Learning from Noisy, Choice-Based Labels for Inherent Risk Scoring

    Authors: W. Ronny Huang, Miguel A. Perez

    Abstract: Inherent risk scoring is an important function in anti-money laundering, used for determining the riskiness of an individual during onboarding $\textit{before}$ fraudulent transactions occur. It is, however, often fraught with two challenges: (1) inconsistent notions of what constitutes as high or low risk by experts and (2) the lack of labeled data. This paper explores a new paradigm of data labe… ▽ More

    Submitted 26 November, 2018; originally announced November 2018.

    Comments: Presented as an oral at the NIPS 2018 Workshop on Challenges and Opportunities for AI in Financial Services: the Impact of Fairness, Explainability, Accuracy, and Privacy (FEAP-AI4Fin 2018). 9 pages, 4 figures

  21. arXiv:1809.02104  [pdf, other

    cs.LG cs.CV stat.ML

    Are adversarial examples inevitable?

    Authors: Ali Shafahi, W. Ronny Huang, Christoph Studer, Soheil Feizi, Tom Goldstein

    Abstract: A wide range of defenses have been proposed to harden neural networks against adversarial attacks. However, a pattern has emerged in which the majority of adversarial defenses are quickly broken by new attacks. Given the lack of success at generating robust defenses, we are led to ask a fundamental question: Are adversarial attacks inevitable? This paper analyzes adversarial examples from a theore… ▽ More

    Submitted 3 February, 2020; v1 submitted 6 September, 2018; originally announced September 2018.

    Journal ref: International Conference on Learning Representations, 2019. https://openreview.net/forum?id=r1lWUoA9FQ

  22. arXiv:1804.00792  [pdf, other

    cs.LG cs.CR cs.CV stat.ML

    Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks

    Authors: Ali Shafahi, W. Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras, Tom Goldstein

    Abstract: Data poisoning is an attack on machine learning models wherein the attacker adds examples to the training set to manipulate the behavior of the model at test time. This paper explores poisoning attacks on neural nets. The proposed attacks use "clean-labels"; they don't require the attacker to have any control over the labeling of training data. They are also targeted; they control the behavior of… ▽ More

    Submitted 10 November, 2018; v1 submitted 2 April, 2018; originally announced April 2018.

    Comments: Presented at the NIPS 2018 conference. 11 pages, 4 figures, with a supplementary section of 7 pages, 7 figures. First two authors contributed equally