Zum Hauptinhalt springen

Showing 1–10 of 10 results for author: Aminabadi, R Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2401.08671  [pdf, other

    cs.PF cs.LG

    DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

    Authors: Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, Yuxiong He

    Abstract: The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation compos… ▽ More

    Submitted 9 January, 2024; originally announced January 2024.

  2. arXiv:2312.08583  [pdf, other

    cs.CL stat.ML

    ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks

    Authors: Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wyatt, Reza Yazdani Aminabadi, Yuxiong He, Olatunji Ruwase, Leon Song, Zhewei Yao

    Abstract: This study examines 4-bit quantization methods like GPTQ in large language models (LLMs), highlighting GPTQ's overfitting and limited enhancement in Zero-Shot tasks. While prior works merely focusing on zero-shot measurement, we extend task scope to more generative categories such as code generation and abstractive summarization, in which we found that INT4 quantization can significantly underperf… ▽ More

    Submitted 18 December, 2023; v1 submitted 13 December, 2023; originally announced December 2023.

  3. arXiv:2310.17723  [pdf, other

    cs.LG cs.CL

    ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers

    Authors: Zhewei Yao, Reza Yazdani Aminabadi, Stephen Youn, Xiaoxia Wu, Elton Zheng, Yuxiong He

    Abstract: Quantization techniques are pivotal in reducing the memory and computational demands of deep neural network inference. Existing solutions, such as ZeroQuant, offer dynamic quantization for models like BERT and GPT but overlook crucial memory-bounded operators and the complexities of per-token quantization. Addressing these gaps, we present a novel, fully hardware-enhanced robust optimized post-tra… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

    Comments: 8 pages, 2 figures

  4. arXiv:2308.01320  [pdf, other

    cs.LG cs.AI cs.CL

    DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales

    Authors: Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, Zhongzhu Zhou, Michael Wyatt, Molly Smith, Lev Kurilenko, Heyang Qin, Masahiro Tanaka, Shuai Che, Shuaiwen Leon Song, Yuxiong He

    Abstract: ChatGPT-like models have revolutionized various applications in artificial intelligence, from summarization and coding to translation, matching or even surpassing human performance. However, the current landscape lacks an accessible, efficient, and cost-effective end-to-end RLHF (Reinforcement Learning with Human Feedback) training pipeline for these powerful models, particularly when training at… ▽ More

    Submitted 2 August, 2023; originally announced August 2023.

    Comments: 14 pages, 7 figures

  5. arXiv:2301.12017  [pdf, other

    cs.CL cs.LG

    Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases

    Authors: Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He

    Abstract: Improving the deployment efficiency of transformer-based language models has been challenging given their high computation and memory cost. While INT8 quantization has recently been shown to be effective in reducing both the memory cost and latency while preserving model accuracy, it remains unclear whether we can leverage INT4 (which doubles peak hardware throughput) to achieve further latency im… ▽ More

    Submitted 30 May, 2023; v1 submitted 27 January, 2023; originally announced January 2023.

    Journal ref: Fortieth International Conference on Machine Learning 2023

  6. arXiv:2207.00032  [pdf, other

    cs.LG cs.DC cs.PF

    DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

    Authors: Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He

    Abstract: The past several years have witnessed the success of transformer-based models, and their scale and application scenarios continue to grow aggressively. The current landscape of transformer models is increasingly diverse: the model size varies drastically with the largest being of hundred-billion parameters; the model characteristics differ due to the sparsity introduced by the Mixture-of-Experts;… ▽ More

    Submitted 30 June, 2022; originally announced July 2022.

  7. arXiv:2206.01861  [pdf, other

    cs.CL cs.LG

    ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

    Authors: Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He

    Abstract: How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements. In this work, we present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as ZeroQuant. ZeroQuant is an end-to-end quantizatio… ▽ More

    Submitted 3 June, 2022; originally announced June 2022.

    Comments: 11 pages, 4 figures

  8. arXiv:2201.11990  [pdf, other

    cs.CL

    Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

    Authors: Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, Bryan Catanzaro

    Abstract: Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models.… ▽ More

    Submitted 4 February, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

    Comments: Shaden Smith and Mostofa Patwary contributed equally

  9. arXiv:2201.05596  [pdf, other

    cs.LG cs.AI cs.DC

    DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

    Authors: Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He

    Abstract: As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model. Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x savin… ▽ More

    Submitted 21 July, 2022; v1 submitted 14 January, 2022; originally announced January 2022.

    Comments: This paper is published at ICML 2022: https://proceedings.mlr.press/v162/rajbhandari22a

  10. arXiv:2101.06840  [pdf, other

    cs.DC cs.LG

    ZeRO-Offload: Democratizing Billion-Scale Model Training

    Authors: Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He

    Abstract: Large-scale model training has been a playing ground for a limited few requiring complex model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large model training landscape by making large model training accessible to nearly everyone. It can train models with over 13 billion parameters on a single GPU, a 10x increase in size compared to popular framework s… ▽ More

    Submitted 17 January, 2021; originally announced January 2021.