Skip to main content

Showing 1–20 of 20 results for author: Kuchaiev, O

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.04528  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    GPT vs RETRO: Exploring the Intersection of Retrieval and Parameter-Efficient Fine-Tuning

    Authors: Aleksander Ficek, Jiaqi Zeng, Oleksii Kuchaiev

    Abstract: Parameter-Efficient Fine-Tuning (PEFT) and Retrieval-Augmented Generation (RAG) have become popular methods for adapting large language models while minimizing compute requirements. In this paper, we apply PEFT methods (P-tuning, Adapters, and LoRA) to a modified Retrieval-Enhanced Transformer (RETRO) and a baseline GPT model across several sizes, ranging from 823 million to 48 billion parameters.… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

  2. arXiv:2406.11704  [pdf, other

    cs.CL cs.AI cs.LG

    Nemotron-4 340B Technical Report

    Authors: Nvidia, :, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorzek , et al. (58 additional authors not shown)

    Abstract: We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation be… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  3. arXiv:2406.08673  [pdf, ps, other

    cs.CL cs.AI cs.LG

    HelpSteer2: Open-source dataset for training top-performing reward models

    Authors: Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J. Zhang, Makesh Narsimhan Sreedhar, Oleksii Kuchaiev

    Abstract: High-quality preference datasets are essential for training reward models that can effectively guide large language models (LLMs) in generating high-quality responses aligned with human preferences. As LLMs become stronger and better aligned, permissively licensed preference datasets, such as Open Assistant, HH-RLHF, and HelpSteer need to be updated to remain effective for reward modeling. Methods… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  4. arXiv:2405.01481  [pdf, other

    cs.CL cs.AI cs.LG

    NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment

    Authors: Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, Markel Sanz Ausin, Ashwath Aithal, Oleksii Kuchaiev

    Abstract: Aligning Large Language Models (LLMs) with human values and preferences is essential for making them helpful and safe. However, building efficient tools to perform alignment can be challenging, especially for the largest and most competent LLMs which often contain tens or hundreds of billions of parameters. We create NeMo-Aligner, a toolkit for model alignment that can efficiently scale to using h… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: 13 pages, 4 figures

  5. arXiv:2402.16819  [pdf, other

    cs.CL cs.AI cs.LG

    Nemotron-4 15B Technical Report

    Authors: Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Mostofa Patwary, Sandeep Subramanian, Dan Su, Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala, Ayush Dattagupta, Vibhu Jawa, Jiwei Liu, Ameya Mahabaleshwarkar, Osvald Nitski, Annika Brundyn, James Maki, Miguel Martinez, Jiaxuan You, John Kamalu, Patrick LeGresley, Denys Fridman, Jared Casper, Ashwath Aithal, Oleksii Kuchaiev, Mohammad Shoeybi , et al. (2 additional authors not shown)

    Abstract: We introduce Nemotron-4 15B, a 15-billion-parameter large multilingual language model trained on 8 trillion text tokens. Nemotron-4 15B demonstrates strong performance when assessed on English, multilingual, and coding tasks: it outperforms all existing similarly-sized open models on 4 out of 7 downstream evaluation areas and achieves competitive performance to the leading open models in the remai… ▽ More

    Submitted 27 February, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

  6. arXiv:2311.09578  [pdf, other

    cs.CL cs.AI cs.LG

    Tied-Lora: Enhancing parameter efficiency of LoRA with weight tying

    Authors: Adithya Renduchintala, Tugrul Konuk, Oleksii Kuchaiev

    Abstract: We introduce Tied-LoRA, a novel paradigm leveraging weight tying and selective training to enhance the parameter efficiency of Low-rank Adaptation (LoRA). Our exploration encompasses different plausible combinations of parameter training and freezing, coupled with weight tying, aimed at identifying the optimal trade-off between performance and the count of trainable parameters. Across $5$ diverse… ▽ More

    Submitted 12 April, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

    Comments: 8 pages 4 figures

  7. arXiv:2311.09528  [pdf, other

    cs.CL cs.AI cs.LG

    HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM

    Authors: Zhilin Wang, Yi Dong, Jiaqi Zeng, Virginia Adams, Makesh Narsimhan Sreedhar, Daniel Egert, Olivier Delalleau, Jane Polak Scowcroft, Neel Kant, Aidan Swope, Oleksii Kuchaiev

    Abstract: Existing open-source helpfulness preference datasets do not specify what makes some responses more helpful and others less so. Models trained on these datasets can incidentally learn to model dataset artifacts (e.g. preferring longer but unhelpful responses only due to their length). To alleviate this problem, we collect HelpSteer, a multi-attribute helpfulness dataset annotated for the various as… ▽ More

    Submitted 15 November, 2023; originally announced November 2023.

  8. arXiv:2310.05344  [pdf, other

    cs.CL cs.AI cs.LG

    SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF

    Authors: Yi Dong, Zhilin Wang, Makesh Narsimhan Sreedhar, Xianchao Wu, Oleksii Kuchaiev

    Abstract: Model alignment with human preferences is an essential step in making Large Language Models (LLMs) helpful and consistent with human values. It typically consists of supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) stages. However, RLHF faces inherent limitations stemming from a complex training setup and its tendency to align the model with implicit values that e… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: Findings of EMNLP 2023

  9. arXiv:2305.06155  [pdf, other

    cs.CL cs.AI cs.LG

    Leveraging Synthetic Targets for Machine Translation

    Authors: Sarthak Mittal, Oleksii Hrinchuk, Oleksii Kuchaiev

    Abstract: In this work, we provide a recipe for training machine translation models in a limited resource setting by leveraging synthetic target data generated using a large pre-trained model. We show that consistently across different benchmarks in bilingual, multilingual, and speech translation setups, training models on synthetic targets outperforms training on the actual ground-truth data. This performa… ▽ More

    Submitted 7 May, 2023; originally announced May 2023.

  10. arXiv:2304.06762  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study

    Authors: Boxin Wang, Wei Ping, Peng Xu, Lawrence McAfee, Zihan Liu, Mohammad Shoeybi, Yi Dong, Oleksii Kuchaiev, Bo Li, Chaowei Xiao, Anima Anandkumar, Bryan Catanzaro

    Abstract: Large decoder-only language models (LMs) can be largely improved in terms of perplexity by retrieval (e.g., RETRO), but its impact on text generation quality and downstream task accuracy is unclear. Thus, it is still an open question: shall we pretrain large autoregressive LMs with retrieval? To answer it, we perform a comprehensive study on a scalable pre-trained retrieval-augmented LM (i.e., RET… ▽ More

    Submitted 20 December, 2023; v1 submitted 13 April, 2023; originally announced April 2023.

    Comments: EMNLP 2023

  11. arXiv:2206.01137  [pdf, other

    cs.CL cs.LG

    Finding the Right Recipe for Low Resource Domain Adaptation in Neural Machine Translation

    Authors: Virginia Adams, Sandeep Subramanian, Mike Chrzanowski, Oleksii Hrinchuk, Oleksii Kuchaiev

    Abstract: General translation models often still struggle to generate accurate translations in specialized domains. To guide machine translation practitioners and characterize the effectiveness of domain adaptation methods under different data availability scenarios, we conduct an in-depth empirical exploration of monolingual and parallel data approaches to domain adaptation of pre-trained, third-party, NMT… ▽ More

    Submitted 2 June, 2022; originally announced June 2022.

  12. arXiv:2111.08634  [pdf, other

    cs.CL cs.LG

    NVIDIA NeMo Neural Machine Translation Systems for English-German and English-Russian News and Biomedical Tasks at WMT21

    Authors: Sandeep Subramanian, Oleksii Hrinchuk, Virginia Adams, Oleksii Kuchaiev

    Abstract: This paper provides an overview of NVIDIA NeMo's neural machine translation systems for the constrained data track of the WMT21 News and Biomedical Shared Translation Tasks. Our news task submissions for English-German (En-De) and English-Russian (En-Ru) are built on top of a baseline transformer-based sequence-to-sequence model. Specifically, we use a combination of 1) checkpoint averaging 2) mod… ▽ More

    Submitted 16 November, 2021; originally announced November 2021.

    Comments: WMT'21 news and biomedical shared task submission

  13. arXiv:2104.02014  [pdf, other

    cs.CL eess.AS

    SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition

    Authors: Patrick K. O'Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, Georg Kucsko

    Abstract: In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters, and any necessary orthography (such as capitalization, punctuation, and denormalization of non-standard words) is imputed by separate post-processing models. This adds complexity and limits performance, as many formatting tasks benefit from semantic information present… ▽ More

    Submitted 6 April, 2021; v1 submitted 5 April, 2021; originally announced April 2021.

    Comments: 5 pages, 1 figure. Submitted to INTERSPEECH 2021

  14. arXiv:1909.09577  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    NeMo: a toolkit for building AI applications using Neural Modules

    Authors: Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, Patrice Castonguay, Mariya Popova, Jocelyn Huang, Jonathan M. Cohen

    Abstract: NeMo (Neural Modules) is a Python framework-agnostic toolkit for creating AI applications through re-usability, abstraction, and composition. NeMo is built around neural modules, conceptual blocks of neural networks that take typed inputs and produce typed outputs. Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations… ▽ More

    Submitted 13 September, 2019; originally announced September 2019.

    Comments: 6 pages plus references

  15. arXiv:1905.11286  [pdf, other

    cs.LG stat.ML

    Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

    Authors: Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Huyen Nguyen, Yang Zhang, Jonathan M. Cohen

    Abstract: We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, and language modeling, it performs on par or better than well tuned SGD with momentum and Adam or AdamW. Additionally, NovoGrad (1) is robust to the choice of l… ▽ More

    Submitted 6 February, 2020; v1 submitted 27 May, 2019; originally announced May 2019.

    Comments: Preprint, under review

  16. arXiv:1904.03288  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Jasper: An End-to-End Convolutional Neural Acoustic Model

    Authors: Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M. Cohen, Huyen Nguyen, Ravi Teja Gadde

    Abstract: In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data. Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections. To improve training, we further introduce a new layer-wise optimizer called NovoGrad. Through experiments, we demonstrate that the proposed deep arc… ▽ More

    Submitted 26 August, 2019; v1 submitted 5 April, 2019; originally announced April 2019.

    Comments: Accepted to INTERSPEECH 2019

  17. arXiv:1805.10387  [pdf, other

    cs.CL

    Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq

    Authors: Oleksii Kuchaiev, Boris Ginsburg, Igor Gitman, Vitaly Lavrukhin, Jason Li, Huyen Nguyen, Carl Case, Paulius Micikevicius

    Abstract: We present OpenSeq2Seq - a TensorFlow-based toolkit for training sequence-to-sequence models that features distributed and mixed-precision training. Benchmarks on machine translation and speech recognition tasks show that models built using OpenSeq2Seq give state-of-the-art performance at 1.5-3x less training time. OpenSeq2Seq currently provides building blocks for models that solve a wide range o… ▽ More

    Submitted 21 November, 2018; v1 submitted 25 May, 2018; originally announced May 2018.

    Comments: Presented at Workshop for Natural Language Processing Open Source Software (NLP-OSS), co-located with ACL2018

  18. arXiv:1710.03740  [pdf, other

    cs.AI cs.LG stat.ML

    Mixed Precision Training

    Authors: Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu

    Abstract: Deep neural networks have enabled progress in a wide variety of applications. Growing the size of the neural network typically results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models also increases. We introduce a technique to train deep neural networks using half precision floating point numbers. In our technique, weights, activations and g… ▽ More

    Submitted 15 February, 2018; v1 submitted 10 October, 2017; originally announced October 2017.

    Comments: Published as a conference paper at ICLR 2018

  19. arXiv:1708.01715  [pdf, other

    stat.ML cs.LG

    Training Deep AutoEncoders for Collaborative Filtering

    Authors: Oleksii Kuchaiev, Boris Ginsburg

    Abstract: This paper proposes a novel model for the rating prediction task in recommender systems which significantly outperforms previous state-of-the art models on a time-split Netflix data set. Our model is based on deep autoencoder with 6 layers and is trained end-to-end without any layer-wise pre-training. We empirically demonstrate that: a) deep autoencoder models generalize much better than the shall… ▽ More

    Submitted 10 October, 2017; v1 submitted 5 August, 2017; originally announced August 2017.

    Comments: 5 pages, 6 figures

  20. arXiv:1703.10722  [pdf, other

    cs.CL cs.NE stat.ML

    Factorization tricks for LSTM networks

    Authors: Oleksii Kuchaiev, Boris Ginsburg

    Abstract: We present two simple ways of reducing the number of parameters and accelerating the training of large Long Short-Term Memory (LSTM) networks: the first one is "matrix factorization by design" of LSTM matrix into the product of two smaller matrices, and the second one is partitioning of LSTM matrix, its inputs and states into the independent groups. Both approaches allow us to train large LSTM net… ▽ More

    Submitted 24 February, 2018; v1 submitted 30 March, 2017; originally announced March 2017.

    Comments: accepted to ICLR 2017 Workshop