Search | arXiv e-print repository

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Authors: Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, Tri Dao

Abstract: Linear RNN architectures, like Mamba, can be competitive with Transformer models in language modeling while having advantageous deployment characteristics. Given the focus on training large-scale Transformer models, we consider the challenge of converting these pretrained models for deployment. We demonstrate that it is feasible to distill large Transformers into linear RNNs by reusing the linear… ▽ More Linear RNN architectures, like Mamba, can be competitive with Transformer models in language modeling while having advantageous deployment characteristics. Given the focus on training large-scale Transformer models, we consider the challenge of converting these pretrained models for deployment. We demonstrate that it is feasible to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer in chat benchmarks and outperforms open-source hybrid Mamba models trained from scratch with trillions of tokens in both chat benchmarks and general benchmarks. Moreover, we introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models. Overall we show how, with limited computation resources, we can remove many of the original attention layers and generate from the resulting model more efficiently. Our top-performing model, distilled from Llama3-8B-Instruct, achieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and 7.35 on MT-Bench, surpassing the best instruction-tuned linear RNN model. △ Less

Submitted 27 August, 2024; originally announced August 2024.

Comments: Code is open-sourced at https://github.com/jxiw/MambaInLlama

arXiv:2406.02532 [pdf, other]

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

Authors: Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin

Abstract: As large language models gain widespread adoption, running them efficiently becomes crucial. Recent works on LLM inference use speculative decoding to achieve extreme speedups. However, most of these works implicitly design their algorithms for high-end datacenter hardware. In this work, we ask the opposite question: how fast can we run LLMs on consumer machines? Consumer GPUs can no longer fit th… ▽ More As large language models gain widespread adoption, running them efficiently becomes crucial. Recent works on LLM inference use speculative decoding to achieve extreme speedups. However, most of these works implicitly design their algorithms for high-end datacenter hardware. In this work, we ask the opposite question: how fast can we run LLMs on consumer machines? Consumer GPUs can no longer fit the largest available models (50B+ parameters) and must offload them to RAM or SSD. When running with offloaded parameters, the inference engine can process batches of hundreds or thousands of tokens at the same time as just one token, making it a natural fit for speculative decoding. We propose SpecExec (Speculative Execution), a simple parallel decoding method that can generate up to 20 tokens per target model iteration for popular LLM families. It utilizes the high spikiness of the token probabilities distribution in modern LLMs and a high degree of alignment between model output probabilities. SpecExec takes the most probable tokens continuation from the draft model to build a "cache" tree for the target model, which then gets validated in a single pass. Using SpecExec, we demonstrate inference of 50B+ parameter LLMs on consumer GPUs with RAM offloading at 4-6 tokens per second with 4-bit quantization or 2-3 tokens per second with 16-bit weights. △ Less

Submitted 25 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

Comments: preprint

arXiv:2404.03039 [pdf, ps, other]

Illustrating Finite Automata with Grail+ and TikZ

Authors: Alastair May, Taylor J. Smith

Abstract: In this article, we discuss a new software tool that interacts with Grail+, a library of automata-theoretic command-line utilities. Our software, the Grail+ Visualizer, takes the textual representation of a finite automaton produced by Grail+ and generates TikZ code to illustrate the finite automaton, with automatic layout of states and transitions. In addition to giving an overview of the basics… ▽ More In this article, we discuss a new software tool that interacts with Grail+, a library of automata-theoretic command-line utilities. Our software, the Grail+ Visualizer, takes the textual representation of a finite automaton produced by Grail+ and generates TikZ code to illustrate the finite automaton, with automatic layout of states and transitions. In addition to giving an overview of the basics of automata theory and Grail+, we discuss how the Grail+ Visualizer works in detail and suggest avenues for future work. △ Less

Submitted 3 April, 2024; originally announced April 2024.

MSC Class: 68-04 (primary); 68Q45 (secondary)

arXiv:2402.12374 [pdf, other]

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

Authors: Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen

Abstract: As the usage of large language models (LLMs) grows, performing efficient inference with these models becomes increasingly important. While speculative decoding has recently emerged as a promising direction for speeding up inference, existing methods are limited in their ability to scale to larger speculation budgets, and adapt to different hyperparameters and hardware. This paper introduces Sequoi… ▽ More As the usage of large language models (LLMs) grows, performing efficient inference with these models becomes increasingly important. While speculative decoding has recently emerged as a promising direction for speeding up inference, existing methods are limited in their ability to scale to larger speculation budgets, and adapt to different hyperparameters and hardware. This paper introduces Sequoia, a scalable, robust, and hardware-aware algorithm for speculative decoding. To attain better scalability, Sequoia introduces a dynamic programming algorithm to find the optimal tree structure for the speculated tokens. To achieve robust speculative performance, Sequoia uses a novel sampling and verification method that outperforms prior work across different decoding temperatures. Finally, Sequoia introduces a hardware-aware tree optimizer that maximizes speculative performance by automatically selecting the token tree size and depth for a given hardware platform. Evaluation shows that Sequoia improves the decoding speed of Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 by up to $4.04\times$, $3.73\times$, and $2.27\times$. For offloading setting on L40, Sequoia achieves as low as 0.56 s/token for exact Llama2-70B inference latency, which is $9.96\times$ on our optimized offloading system (5.6 s/token), $9.7\times$ than DeepSpeed-Zero-Inference, $19.5\times$ than Huggingface Accelerate. △ Less

Submitted 29 February, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

arXiv:2312.09369 [pdf, other]

Audio-visual fine-tuning of audio-only ASR models

Authors: Avner May, Dmitriy Serdyuk, Ankit Parag Shah, Otavio Braga, Olivier Siohan

Abstract: Audio-visual automatic speech recognition (AV-ASR) models are very effective at reducing word error rates on noisy speech, but require large amounts of transcribed AV training data. Recently, audio-visual self-supervised learning (SSL) approaches have been developed to reduce this dependence on transcribed AV data, but these methods are quite complex and computationally expensive. In this work, we… ▽ More Audio-visual automatic speech recognition (AV-ASR) models are very effective at reducing word error rates on noisy speech, but require large amounts of transcribed AV training data. Recently, audio-visual self-supervised learning (SSL) approaches have been developed to reduce this dependence on transcribed AV data, but these methods are quite complex and computationally expensive. In this work, we propose replacing these expensive AV-SSL methods with a simple and fast \textit{audio-only} SSL method, and then performing AV supervised fine-tuning. We show that this approach is competitive with state-of-the-art (SOTA) AV-SSL methods on the LRS3-TED benchmark task (within 0.5% absolute WER), while being dramatically simpler and more efficient (12-30x faster to pre-train). Furthermore, we show we can extend this approach to convert a SOTA audio-only ASR model into an AV model. By doing so, we match SOTA AV-SSL results, even though no AV data was used during pre-training. △ Less

Submitted 14 December, 2023; originally announced December 2023.

arXiv:2212.09616 [pdf, other]

doi 10.1109/BigData55660.2022.10020380

Pseudonymization at Scale: OLCF's Summit Usage Data Case Study

Authors: Ketan Maheshwari, Sean R. Wilkinson, Alex May, Tyler Skluzacek, Olga A. Kuchar, Rafael Ferreira da Silva

Abstract: The analysis of vast amounts of data and the processing of complex computational jobs have traditionally relied upon high performance computing (HPC) systems. Understanding these analyses' needs is paramount for designing solutions that can lead to better science, and similarly, understanding the characteristics of the user behavior on those systems is important for improving user experiences on H… ▽ More The analysis of vast amounts of data and the processing of complex computational jobs have traditionally relied upon high performance computing (HPC) systems. Understanding these analyses' needs is paramount for designing solutions that can lead to better science, and similarly, understanding the characteristics of the user behavior on those systems is important for improving user experiences on HPC systems. A common approach to gathering data about user behavior is to analyze system log data available only to system administrators. Recently at Oak Ridge Leadership Computing Facility (OLCF), however, we unveiled user behavior about the Summit supercomputer by collecting data from a user's point of view with ordinary Unix commands. Here, we discuss the process, challenges, and lessons learned while preparing this dataset for publication and submission to an open data challenge. The original dataset contains personal identifiable information (PII) about OLCF users which needed be masked prior to publication, and we determined that anonymization, which scrubs PII completely, destroyed too much of the structure of the data to be interesting for the data challenge. We instead chose to pseudonymize the dataset to reduce its linkability to users' identities. Pseudonymization is significantly more computationally expensive than anonymization, and the size of our dataset, approximately 175 million lines of raw text, necessitated the development of a parallelized workflow that could be reused on different HPC machines. We demonstrate the scaling behavior of the workflow on two leadership class HPC systems at OLCF, and we show that we were able to bring the overall makespan time from an impractical 20+ hours on a single node down to around 2 hours. As a result of this work, we release the entire pseudonymized dataset and make the workflows and source code publicly available. △ Less

Submitted 19 December, 2022; originally announced December 2022.

Comments: 9 pages, 5 figures, accepted to BTSD 2022 workshop (see https://sites.google.com/view/btsd2022 for more information), to be published in the proceedings of IEEE Big Data 2022

arXiv:2112.04838 [pdf, other]

doi 10.1109/SP46214.2022.9833605

How Not to Protect Your IP -- An Industry-Wide Break of IEEE 1735 Implementations

Authors: Julian Speith, Florian Schweins, Maik Ender, Marc Fyrbiak, Alexander May, Christof Paar

Abstract: Modern hardware systems are composed of a variety of third-party Intellectual Property (IP) cores to implement their overall functionality. Since hardware design is a globalized process involving various (untrusted) stakeholders, a secure management of the valuable IP between authors and users is inevitable to protect them from unauthorized access and modification. To this end, the widely adopted… ▽ More Modern hardware systems are composed of a variety of third-party Intellectual Property (IP) cores to implement their overall functionality. Since hardware design is a globalized process involving various (untrusted) stakeholders, a secure management of the valuable IP between authors and users is inevitable to protect them from unauthorized access and modification. To this end, the widely adopted IEEE standard 1735-2014 was created to ensure confidentiality and integrity. In this paper, we outline structural weaknesses in IEEE 1735 that cannot be fixed with cryptographic solutions (given the contemporary hardware design process) and thus render the standard inherently insecure. We practically demonstrate the weaknesses by recovering the private keys of IEEE 1735 implementations from major Electronic Design Automation (EDA) tool vendors, namely Intel, Xilinx, Cadence, Siemens, Microsemi, and Lattice, while results on a seventh case study are withheld. As a consequence, we can decrypt, modify, and re-encrypt all allegedly protected IP cores designed for the respective tools, thus leading to an industry-wide break. As part of this analysis, we are the first to publicly disclose three RSA-based white-box schemes that are used in real-world products and present cryptanalytical attacks for all of them, finally resulting in key recovery. △ Less

Submitted 9 December, 2021; originally announced December 2021.

arXiv:2005.09117 [pdf, other]

Contextual Embeddings: When Are They Worth It?

Authors: Simran Arora, Avner May, Jian Zhang, Christopher Ré

Abstract: We study the settings for which deep contextual embeddings (e.g., BERT) give large improvements in performance relative to classic pretrained embeddings (e.g., GloVe), and an even simpler baseline---random word embeddings---focusing on the impact of the training set size and the linguistic properties of the task. Surprisingly, we find that both of these simpler baselines can match contextual embed… ▽ More We study the settings for which deep contextual embeddings (e.g., BERT) give large improvements in performance relative to classic pretrained embeddings (e.g., GloVe), and an even simpler baseline---random word embeddings---focusing on the impact of the training set size and the linguistic properties of the task. Surprisingly, we find that both of these simpler baselines can match contextual embeddings on industry-scale data, and often perform within 5 to 10% accuracy (absolute) on benchmark tasks. Furthermore, we identify properties of data for which contextual embeddings give particularly large gains: language containing complex structure, ambiguous word usage, and words unseen in training. △ Less

Submitted 18 May, 2020; originally announced May 2020.

Comments: ACL 2020

arXiv:2003.04983 [pdf, other]

Understanding the Downstream Instability of Word Embeddings

Authors: Megan Leszczynski, Avner May, Jian Zhang, Sen Wu, Christopher R. Aberger, Christopher Ré

Abstract: Many industrial machine learning (ML) systems require frequent retraining to keep up-to-date with constantly changing data. This retraining exacerbates a large challenge facing ML systems today: model training is unstable, i.e., small changes in training data can cause significant changes in the model's predictions. In this paper, we work on developing a deeper understanding of this instability, w… ▽ More Many industrial machine learning (ML) systems require frequent retraining to keep up-to-date with constantly changing data. This retraining exacerbates a large challenge facing ML systems today: model training is unstable, i.e., small changes in training data can cause significant changes in the model's predictions. In this paper, we work on developing a deeper understanding of this instability, with a focus on how a core building block of modern natural language processing (NLP) pipelines---pre-trained word embeddings---affects the instability of downstream NLP models. We first empirically reveal a tradeoff between stability and memory: increasing the embedding memory 2x can reduce the disagreement in predictions due to small changes in training data by 5% to 37% (relative). To theoretically explain this tradeoff, we introduce a new measure of embedding instability---the eigenspace instability measure---which we prove bounds the disagreement in downstream predictions introduced by the change in word embeddings. Practically, we show that the eigenspace instability measure can be a cost-effective way to choose embedding parameters to minimize instability without training downstream models, outperforming other embedding distance measures and performing competitively with a nearest neighbor-based measure. Finally, we demonstrate that the observed stability-memory tradeoffs extend to other types of embeddings as well, including knowledge graph and contextual word embeddings. △ Less

Submitted 28 February, 2020; originally announced March 2020.

Comments: In Proceedings of the 3rd MLSys Conference, 2020

arXiv:1910.00802 [pdf, other]

Noisy Simon Period Finding

Authors: Alexander May, Lars Schlieper, Jonathan Schwinger

Abstract: Let $f: \mathbb{F}_2^n \rightarrow \mathbb{F}_2^n$ be a Boolean function with period $\vec s$. It is well-known that Simon's algorithm finds $\vec s$ in time polynomial in $n$ on quantum devices that are capable of performing error-correction. However, today's quantum devices are inherently noisy, too limited for error correction, and Simon's algorithm is not error-tolerant. We show that even no… ▽ More Let $f: \mathbb{F}_2^n \rightarrow \mathbb{F}_2^n$ be a Boolean function with period $\vec s$. It is well-known that Simon's algorithm finds $\vec s$ in time polynomial in $n$ on quantum devices that are capable of performing error-correction. However, today's quantum devices are inherently noisy, too limited for error correction, and Simon's algorithm is not error-tolerant. We show that even noisy quantum period finding computations may lead to speedups in comparison to purely classical computations. To this end, we implemented Simon's quantum period finding circuit on the $15$-qubit quantum device IBM Q 16 Melbourne. Our experiments show that with a certain probability $τ(n)$ we measure erroneous vectors that are not orthogonal to $\vec s$. We propose new, simple, but very effective smoothing techniques to classically mitigate physical noise effects such as e.g. IBM Q's bias towards the $0$-qubit. After smoothing, our noisy quantum device provides us a statistical distribution that we can easily transform into an LPN instance with parameters $n$ and $τ(n)$. Hence, in the noisy case we may not hope to find periods in time polynomial in $n$. However, we may still obtain a quantum advantage if the error $τ(n)$ does not grow too large. This demonstrates that quantum devices may be useful for period finding, even before achieving the level of full error correction capability. △ Less

Submitted 10 March, 2021; v1 submitted 2 October, 2019; originally announced October 2019.

arXiv:1909.01264 [pdf, other]

On the Downstream Performance of Compressed Word Embeddings

Authors: Avner May, Jian Zhang, Tri Dao, Christopher Ré

Abstract: Compressing word embeddings is important for deploying NLP models in memory-constrained settings. However, understanding what makes compressed embeddings perform well on downstream tasks is challenging---existing measures of compression quality often fail to distinguish between embeddings that perform well and those that do not. We thus propose the eigenspace overlap score as a new measure. We rel… ▽ More Compressing word embeddings is important for deploying NLP models in memory-constrained settings. However, understanding what makes compressed embeddings perform well on downstream tasks is challenging---existing measures of compression quality often fail to distinguish between embeddings that perform well and those that do not. We thus propose the eigenspace overlap score as a new measure. We relate the eigenspace overlap score to downstream performance by developing generalization bounds for the compressed embeddings in terms of this score, in the context of linear and logistic regression. We then show that we can lower bound the eigenspace overlap score for a simple uniform quantization compression method, helping to explain the strong empirical performance of this method. Finally, we show that by using the eigenspace overlap score as a selection criterion between embeddings drawn from a representative set we compressed, we can efficiently identify the better performing embedding with up to $2\times$ lower selection error rates than the next best measure of compression quality, and avoid the cost of training a model for each task of interest. △ Less

Submitted 14 January, 2020; v1 submitted 3 September, 2019; originally announced September 2019.

Comments: NeurIPS 2019 spotlight (Conference on Neural Information Processing Systems)

arXiv:1907.04295

Better Sample -- Random Subset Sum in $2^{0.255n}$ and its Impact on Decoding Random Linear Codes

Authors: Andre Esser, Alexander May

Abstract: We propose a new heuristic algorithm for solving random subset sum instances $a_1, \ldots, a_n, t \in \mathbb{Z}_{2^n}$, which play a crucial role in cryptographic constructions. Our algorithm is search tree-based and solves the instances in a divide-and-conquer method using the representation method. From a high level perspective, our algorithm is similar to the algorithm of Howgrave-Graham-Joux… ▽ More We propose a new heuristic algorithm for solving random subset sum instances $a_1, \ldots, a_n, t \in \mathbb{Z}_{2^n}$, which play a crucial role in cryptographic constructions. Our algorithm is search tree-based and solves the instances in a divide-and-conquer method using the representation method. From a high level perspective, our algorithm is similar to the algorithm of Howgrave-Graham-Joux (HGJ) and Becker-Coron-Joux (BCJ), but instead of enumerating the initial lists we sample candidate solutions. So whereas HGJ and BCJ are based on combinatorics, our analysis is stochastic. Our sampling technique introduces variance that increases the amount of representations and gives our algorithm more optimization flexibility. This results in the remarkable and natural property that we improve with increasing search tree depth. Whereas BCJ achieves the currently best known (heuristic) run time $2^{0.291n}$ for random subset sum, we improve (heuristically) down to $2^{0.255n}$ using a search tree of depth at least $13$. We also apply our subset algorithm to the decoding of random binary linear codes, where we improve the best known run time of the Becker-Joux-May-Meurer algorithm from $2^{0.048n}$ in the half distance decoding setting down to $2^{0.042n}$. △ Less

Submitted 21 October, 2019; v1 submitted 9 July, 2019; originally announced July 2019.

Comments: Issue with counting duplicate representations

arXiv:1905.10074 [pdf, other]

Quantum Period Finding is Compression Robust

Authors: Alexander May, Lars Schlieper

Abstract: We study quantum period finding algorithms such as Simon and Shor (and its variants Ekerå-Håstad and Mosca-Ekert). For a periodic function $f$ these algorithms produce -- via some quantum embedding of $f$ -- a quantum superposition $\sum_x |x\rangle|f(x)\rangle$, which requires a certain amount of output qubits that represent $|f(x)\rangle$. We show that one can lower this amount to a single outpu… ▽ More We study quantum period finding algorithms such as Simon and Shor (and its variants Ekerå-Håstad and Mosca-Ekert). For a periodic function $f$ these algorithms produce -- via some quantum embedding of $f$ -- a quantum superposition $\sum_x |x\rangle|f(x)\rangle$, which requires a certain amount of output qubits that represent $|f(x)\rangle$. We show that one can lower this amount to a single output qubit by hashing $f$ down to a single bit in an oracle setting. Namely, we replace the embedding of $f$ in quantum period finding circuits by oracle access to several embeddings of hashed versions of $f$. We show that on expectation this modification only doubles the required amount of quantum measurements, while significantly reducing the total number of qubits. For example, for Simon's algorithm that finds periods in $f: \mathbb{F}_2^n \rightarrow \mathbb{F}_2^n$ our hashing technique reduces the required output qubits from $n$ down to $1$, and therefore the total amount of qubits from $2n$ to $n+1$. We also show that Simon's algorithm admits real world applications with only $n+1$ qubits by giving a concrete realization of a hashed version of the cryptographic Even-Mansour construction. Moreover, for a variant of Simon's algorithm on Even-Mansour that requires only classical queries to Even-Mansour we save a factor of (roughly) $4$ in the qubits. Our oracle-based hashed version of the Ekerå-Håstad algorithm for factoring $n$-bit RSA reduces the required qubits from $(\frac 3 2 + o(1))n$ down to $(\frac 1 2 + o(1))n$. We also show a real-world (non-oracle) application in the discrete logarithm setting by giving a concrete realization of a hashed version of Mosca-Ekert for the Decisional Diffie Hellman problem in $\mathbb{F}_{p^m}$, thereby reducing the number of qubits by even a linear factor from $m \log p$ downto $\log p$. △ Less

Submitted 15 February, 2021; v1 submitted 24 May, 2019; originally announced May 2019.

arXiv:1811.11539 [pdf, other]

doi 10.1007/s10664-019-09685-x

Gender Differences in Participation and Reward on Stack Overflow

Authors: Anna May, Johannes Wachs, Aniko Hannak

Abstract: Programming is a valuable skill in the labor market, making the underrepresentation of women in computing an increasingly important issue. Online question and answer platforms serve a dual purpose in this field: they form a body of knowledge useful as a reference and learning tool, and they provide opportunities for individuals to demonstrate credible, verifiable expertise. Issues, such as male-or… ▽ More Programming is a valuable skill in the labor market, making the underrepresentation of women in computing an increasingly important issue. Online question and answer platforms serve a dual purpose in this field: they form a body of knowledge useful as a reference and learning tool, and they provide opportunities for individuals to demonstrate credible, verifiable expertise. Issues, such as male-oriented site design or overrepresentation of men among the site's elite may therefore compound the issue of women's underrepresentation in IT. In this paper we audit the differences in behavior and outcomes between men and women on Stack Overflow, the most popular of these Q&A sites. We observe significant differences in how men and women participate in the platform and how successful they are. For example, the average woman has roughly half of the reputation points, the primary measure of success on the site, of the average man. Using an Oaxaca-Blinder decomposition, an econometric technique commonly applied to analyze differences in wages between groups, we find that most of the gap in success between men and women can be explained by differences in their activity on the site and differences in how these activities are rewarded. Specifically, 1) men give more answers than women and 2) are rewarded more for their answers on average, even when controlling for possible confounders such as tenure or buy-in to the site. Women ask more questions and gain more reward per question. We conclude with a hypothetical redesign of the site's scoring system based on these behavioral differences, cutting the reputation gap in half. △ Less

Submitted 28 November, 2018; originally announced November 2018.

Journal ref: Empirical Software Engineering 2019

arXiv:1811.00155 [pdf, other]

Low-Precision Random Fourier Features for Memory-Constrained Kernel Approximation

Authors: Jian Zhang, Avner May, Tri Dao, Christopher Ré

Abstract: We investigate how to train kernel approximation methods that generalize well under a memory budget. Building on recent theoretical work, we define a measure of kernel approximation error which we find to be more predictive of the empirical generalization performance of kernel approximation methods than conventional metrics. An important consequence of this definition is that a kernel approximatio… ▽ More We investigate how to train kernel approximation methods that generalize well under a memory budget. Building on recent theoretical work, we define a measure of kernel approximation error which we find to be more predictive of the empirical generalization performance of kernel approximation methods than conventional metrics. An important consequence of this definition is that a kernel approximation matrix must be high rank to attain close approximation. Because storing a high-rank approximation is memory intensive, we propose using a low-precision quantization of random Fourier features (LP-RFFs) to build a high-rank approximation under a memory budget. Theoretically, we show quantization has a negligible effect on generalization performance in important settings. Empirically, we demonstrate across four benchmark datasets that LP-RFFs can match the performance of full-precision RFFs and the Nyström method, with 3x-10x and 50x-460x less memory, respectively. △ Less

Submitted 20 March, 2019; v1 submitted 31 October, 2018; originally announced November 2018.

Comments: International Conference on Artificial Intelligence and Statistics (AISTATS) 2019

arXiv:1701.03577 [pdf, ps, other]

Kernel Approximation Methods for Speech Recognition

Authors: Avner May, Alireza Bagheri Garakani, Zhiyun Lu, Dong Guo, Kuan Liu, Aurélien Bellet, Linxi Fan, Michael Collins, Daniel Hsu, Brian Kingsbury, Michael Picheny, Fei Sha

Abstract: We study large-scale kernel methods for acoustic modeling in speech recognition and compare their performance to deep neural networks (DNNs). We perform experiments on four speech recognition datasets, including the TIMIT and Broadcast News benchmark tasks, and compare these two types of models on frame-level performance metrics (accuracy, cross-entropy), as well as on recognition metrics (word/ch… ▽ More We study large-scale kernel methods for acoustic modeling in speech recognition and compare their performance to deep neural networks (DNNs). We perform experiments on four speech recognition datasets, including the TIMIT and Broadcast News benchmark tasks, and compare these two types of models on frame-level performance metrics (accuracy, cross-entropy), as well as on recognition metrics (word/character error rate). In order to scale kernel methods to these large datasets, we use the random Fourier feature method of Rahimi and Recht (2007). We propose two novel techniques for improving the performance of kernel acoustic models. First, in order to reduce the number of random features required by kernel models, we propose a simple but effective method for feature selection. The method is able to explore a large number of non-linear features while maintaining a compact model more efficiently than existing approaches. Second, we present a number of frame-level metrics which correlate very strongly with recognition performance when computed on the heldout set; we take advantage of these correlations by monitoring these metrics during training in order to decide when to stop learning. This technique can noticeably improve the recognition performance of both DNN and kernel models, while narrowing the gap between them. Additionally, we show that the linear bottleneck method of Sainath et al. (2013) improves the performance of our kernel models significantly, in addition to speeding up training and making the models more compact. Together, these three methods dramatically improve the performance of kernel acoustic models, making their performance comparable to DNNs on the tasks we explored. △ Less

Submitted 13 January, 2017; originally announced January 2017.

arXiv:1604.07163 [pdf, other]

Extreme-scale Multigrid Components within PETSc

Authors: Dave A. May, Patrick Sanan, Karl Rupp, Matthew G. Knepley, Barry F. Smith

Abstract: Elliptic partial differential equations (PDEs) frequently arise in continuum descriptions of physical processes relevant to science and engineering. Multilevel preconditioners represent a family of scalable techniques for solving discrete PDEs of this type and thus are the method of choice for high-resolution simulations. The scalability and time-to-solution of massively parallel multilevel precon… ▽ More Elliptic partial differential equations (PDEs) frequently arise in continuum descriptions of physical processes relevant to science and engineering. Multilevel preconditioners represent a family of scalable techniques for solving discrete PDEs of this type and thus are the method of choice for high-resolution simulations. The scalability and time-to-solution of massively parallel multilevel preconditioners can be adversely effected by using a coarse-level solver with sub-optimal algorithmic complexity. To maintain scalability, agglomeration techniques applied to the coarse level have been shown to be necessary. In this work, we present a new software component introduced within the Portable Extensible Toolkit for Scientific computation (PETSc) which permits agglomeration. We provide an overview of the design and implementation of this functionality, together with several use cases highlighting the benefits of agglomeration. Lastly, we demonstrate via numerical experiments employing geometric multigrid with structured meshes, the flexibility and performance gains possible using our MPI-rank agglomeration implementation. △ Less

Submitted 25 April, 2016; originally announced April 2016.

arXiv:1603.05800 [pdf, ps, other]

A Comparison between Deep Neural Nets and Kernel Acoustic Models for Speech Recognition

Authors: Zhiyun Lu, Dong Guo, Alireza Bagheri Garakani, Kuan Liu, Avner May, Aurelien Bellet, Linxi Fan, Michael Collins, Brian Kingsbury, Michael Picheny, Fei Sha

Abstract: We study large-scale kernel methods for acoustic modeling and compare to DNNs on performance metrics related to both acoustic modeling and recognition. Measuring perplexity and frame-level classification accuracy, kernel-based acoustic models are as effective as their DNN counterparts. However, on token-error-rates DNN models can be significantly better. We have discovered that this might be attri… ▽ More We study large-scale kernel methods for acoustic modeling and compare to DNNs on performance metrics related to both acoustic modeling and recognition. Measuring perplexity and frame-level classification accuracy, kernel-based acoustic models are as effective as their DNN counterparts. However, on token-error-rates DNN models can be significantly better. We have discovered that this might be attributed to DNN's unique strength in reducing both the perplexity and the entropy of the predicted posterior probabilities. Motivated by our findings, we propose a new technique, entropy regularized perplexity, for model selection. This technique can noticeably improve the recognition performance of both types of models, and reduces the gap between them. While effective on Broadcast News, this technique could be also applicable to other tasks. △ Less

Submitted 18 March, 2016; originally announced March 2016.

Comments: arXiv admin note: text overlap with arXiv:1411.4000

arXiv:1411.4000 [pdf, other]

How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets

Authors: Zhiyun Lu, Avner May, Kuan Liu, Alireza Bagheri Garakani, Dong Guo, Aurélien Bellet, Linxi Fan, Michael Collins, Brian Kingsbury, Michael Picheny, Fei Sha

Abstract: The computational complexity of kernel methods has often been a major barrier for applying them to large-scale learning problems. We argue that this barrier can be effectively overcome. In particular, we develop methods to scale up kernel models to successfully tackle large-scale learning problems that are so far only approachable by deep learning architectures. Based on the seminal work by Rahimi… ▽ More The computational complexity of kernel methods has often been a major barrier for applying them to large-scale learning problems. We argue that this barrier can be effectively overcome. In particular, we develop methods to scale up kernel models to successfully tackle large-scale learning problems that are so far only approachable by deep learning architectures. Based on the seminal work by Rahimi and Recht on approximating kernel functions with features derived from random projections, we advance the state-of-the-art by proposing methods that can efficiently train models with hundreds of millions of parameters, and learn optimal representations from multiple kernels. We conduct extensive empirical studies on problems from image recognition and automatic speech recognition, and show that the performance of our kernel models matches that of well-engineered deep neural nets (DNNs). To the best of our knowledge, this is the first time that a direct comparison between these two methods on large-scale problems is reported. Our kernel methods have several appealing properties: training with convex optimization, cost for training a single model comparable to DNNs, and significantly reduced total cost due to fewer hyperparameters to tune for model selection. Our contrastive study between these two very different but equally competitive models sheds light on fundamental questions such as how to learn good representations. △ Less

Submitted 17 June, 2015; v1 submitted 14 November, 2014; originally announced November 2014.

arXiv:1107.5951 [pdf, other]

doi 10.1111/j.1365-246X.2011.05167.x

Optimal, scalable forward models for computing gravity anomalies

Authors: Dave A. May, Matthew G. Knepley

Abstract: We describe three approaches for computing a gravity signal from a density anomaly. The first approach consists of the classical "summation" technique, whilst the remaining two methods solve the Poisson problem for the gravitational potential using either a Finite Element (FE) discretization employing a multilevel preconditioner, or a Green's function evaluated with the Fast Multipole Method (FMM)… ▽ More We describe three approaches for computing a gravity signal from a density anomaly. The first approach consists of the classical "summation" technique, whilst the remaining two methods solve the Poisson problem for the gravitational potential using either a Finite Element (FE) discretization employing a multilevel preconditioner, or a Green's function evaluated with the Fast Multipole Method (FMM). The methods utilizing the PDE formulation described here differ from previously published approaches used in gravity modeling in that they are optimal, implying that both the memory and computational time required scale linearly with respect to the number of unknowns in the potential field. Additionally, all of the implementations presented here are developed such that the computations can be performed in a massively parallel, distributed memory computing environment. Through numerical experiments, we compare the methods on the basis of their discretization error, CPU time and parallel scalability. We demonstrate the parallel scalability of all these techniques by running forward models with up to $10^8$ voxels on 1000's of cores. △ Less

Submitted 29 July, 2011; originally announced July 2011.

Comments: 38 pages, 13 figures; accepted by Geophysical Journal International

Journal ref: Geophysical Journal International, 187(1):161-177, 2011

Showing 1–20 of 20 results for author: May, A