Zum Hauptinhalt springen

Showing 1–16 of 16 results for author: van Baalen, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.16712  [pdf, other

    cs.LG

    Rapid Switching and Multi-Adapter Fusion via Sparse High Rank Adapters

    Authors: Kartikeya Bhardwaj, Nilesh Prasad Pandey, Sweta Priyadarshi, Viswanath Ganapathy, Rafael Esteves, Shreya Kadambi, Shubhankar Borse, Paul Whatmough, Risheek Garrepalli, Mart Van Baalen, Harris Teague, Markus Nagel

    Abstract: In this paper, we propose Sparse High Rank Adapters (SHiRA) that directly finetune 1-2% of the base model weights while leaving others unchanged, thus, resulting in a highly sparse adapter. This high sparsity incurs no inference overhead, enables rapid switching directly in the fused mode, and significantly reduces concept-loss during multi-adapter fusion. Our extensive experiments on LVMs and LLM… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: Published at ICML 2024 Workshop on Foundation Models in the Wild. arXiv admin note: substantial text overlap with arXiv:2406.13175

  2. arXiv:2406.13175  [pdf, other

    cs.LG cs.AI

    Sparse High Rank Adapters

    Authors: Kartikeya Bhardwaj, Nilesh Prasad Pandey, Sweta Priyadarshi, Viswanath Ganapathy, Rafael Esteves, Shreya Kadambi, Shubhankar Borse, Paul Whatmough, Risheek Garrepalli, Mart Van Baalen, Harris Teague, Markus Nagel

    Abstract: Low Rank Adaptation (LoRA) has gained massive attention in the recent generative AI research. One of the main advantages of LoRA is its ability to be fused with pretrained models adding no overhead during inference. However, from a mobile deployment standpoint, we can either avoid inference overhead in the fused mode but lose the ability to switch adapters rapidly, or suffer significant (up to 30%… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  3. arXiv:2402.15319  [pdf, other

    cs.LG cs.CL

    GPTVQ: The Blessing of Dimensionality for LLM Quantization

    Authors: Mart van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, Paul Whatmough

    Abstract: In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining un… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

  4. arXiv:2312.17244  [pdf, other

    cs.LG cs.CL

    The LLM Surgeon

    Authors: Tycho F. A. van der Ouderaa, Markus Nagel, Mart van Baalen, Yuki M. Asano, Tijmen Blankevoort

    Abstract: State-of-the-art language models are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it difficult to deploy models within computational, environmental or device-specific constraints. We explore data-driven compression of existing pretrained models as an alternative… ▽ More

    Submitted 20 March, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

  5. arXiv:2307.04535  [pdf, other

    cs.LG cs.AI cs.CV

    QBitOpt: Fast and Accurate Bitwidth Reallocation during Training

    Authors: Jorn Peters, Marios Fournarakis, Markus Nagel, Mart van Baalen, Tijmen Blankevoort

    Abstract: Quantizing neural networks is one of the most effective methods for achieving efficient inference on mobile and embedded devices. In particular, mixed precision quantized (MPQ) networks, whose layers can be quantized to different bitwidths, achieve better task performance for the same resource constraint compared to networks with homogeneous bitwidths. However, finding the optimal bitwidth allocat… ▽ More

    Submitted 10 July, 2023; originally announced July 2023.

  6. arXiv:2307.02973  [pdf, other

    cs.LG

    Pruning vs Quantization: Which is Better?

    Authors: Andrey Kuzmin, Markus Nagel, Mart van Baalen, Arash Behboodi, Tijmen Blankevoort

    Abstract: Neural network pruning and quantization techniques are almost as old as neural networks themselves. However, to date only ad-hoc comparisons between the two have been published. In this paper, we set out to answer the question on which is better: neural network quantization or pruning? By answering this question, we hope to inform design decisions made on neural network hardware going forward. We… ▽ More

    Submitted 16 February, 2024; v1 submitted 6 July, 2023; originally announced July 2023.

  7. arXiv:2303.17951  [pdf, other

    cs.LG

    FP8 versus INT8 for efficient deep learning inference

    Authors: Mart van Baalen, Andrey Kuzmin, Suparna S Nair, Yuwei Ren, Eric Mahurin, Chirag Patel, Sundar Subramanian, Sanghyuk Lee, Markus Nagel, Joseph Soriaga, Tijmen Blankevoort

    Abstract: Recently, the idea of using FP8 as a number format for neural network training has been floating around the deep learning world. Given that most training is currently conducted with entire networks in FP32, or sometimes FP16 with mixed-precision, the step to having some parts of a network run in FP8 with 8-bit weights is an appealing potential speed-up for the generally costly and time-intensive t… ▽ More

    Submitted 15 June, 2023; v1 submitted 31 March, 2023; originally announced March 2023.

  8. arXiv:2302.05397  [pdf, other

    cs.LG

    A Practical Mixed Precision Algorithm for Post-Training Quantization

    Authors: Nilesh Prasad Pandey, Markus Nagel, Mart van Baalen, Yin Huang, Chirag Patel, Tijmen Blankevoort

    Abstract: Neural network quantization is frequently used to optimize model size, latency and power consumption for on-device deployment of neural networks. In many cases, a target bit-width is set for an entire network, meaning every layer get quantized to the same number of bits. However, for many networks some layers are significantly more robust to quantization noise than others, leaving an important axi… ▽ More

    Submitted 10 February, 2023; originally announced February 2023.

  9. arXiv:2208.09225  [pdf, other

    cs.LG

    FP8 Quantization: The Power of the Exponent

    Authors: Andrey Kuzmin, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters, Tijmen Blankevoort

    Abstract: When quantizing neural networks for efficient inference, low-bit integers are the go-to format for efficiency. However, low-bit floating point numbers have an extra degree of freedom, assigning some bits to work on an exponential scale instead. This paper in-depth investigates this benefit of the floating point format for neural network inference. We detail the choices that can be made for the FP8… ▽ More

    Submitted 23 February, 2024; v1 submitted 19 August, 2022; originally announced August 2022.

  10. arXiv:2207.11048  [pdf, other

    cs.LG

    Quantized Sparse Weight Decomposition for Neural Network Compression

    Authors: Andrey Kuzmin, Mart van Baalen, Markus Nagel, Arash Behboodi

    Abstract: In this paper, we introduce a novel method of neural network weight compression. In our method, we store weight tensors as sparse, quantized matrix factors, whose product is computed on the fly during inference to generate the target model's weights. We use projected gradient descent methods to find quantized and sparse factorization of the weight tensors. We show that this approach can be seen as… ▽ More

    Submitted 22 July, 2022; originally announced July 2022.

  11. arXiv:2202.01290  [pdf, other

    cs.LG cs.CV

    Cyclical Pruning for Sparse Neural Networks

    Authors: Suraj Srinivas, Andrey Kuzmin, Markus Nagel, Mart van Baalen, Andrii Skliar, Tijmen Blankevoort

    Abstract: Current methods for pruning neural network weights iteratively apply magnitude-based pruning on the model weights and re-train the resulting model to recover lost accuracy. In this work, we show that such strategies do not allow for the recovery of erroneously pruned weights. To enable weight recovery, we propose a simple strategy called \textit{cyclical pruning} which requires the pruning schedul… ▽ More

    Submitted 2 February, 2022; originally announced February 2022.

  12. arXiv:2106.08295  [pdf, other

    cs.LG cs.AI cs.CV

    A White Paper on Neural Network Quantization

    Authors: Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, Tijmen Blankevoort

    Abstract: While neural networks have advanced the frontiers in many applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is key if we want to integrate modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings but the additional noise… ▽ More

    Submitted 15 June, 2021; originally announced June 2021.

  13. arXiv:2005.07093  [pdf, other

    cs.LG cs.CV stat.ML

    Bayesian Bits: Unifying Quantization and Pruning

    Authors: Mart van Baalen, Christos Louizos, Markus Nagel, Rana Ali Amjad, Ying Wang, Tijmen Blankevoort, Max Welling

    Abstract: We introduce Bayesian Bits, a practical method for joint mixed precision quantization and pruning through gradient based optimization. Bayesian Bits employs a novel decomposition of the quantization operation, which sequentially considers doubling the bit width. At each new bit width, the residual error between the full precision value and the previously rounded value is quantized. We then decide… ▽ More

    Submitted 27 October, 2020; v1 submitted 14 May, 2020; originally announced May 2020.

  14. arXiv:2004.10568  [pdf, other

    cs.LG cs.CV stat.ML

    Up or Down? Adaptive Rounding for Post-Training Quantization

    Authors: Markus Nagel, Rana Ali Amjad, Mart van Baalen, Christos Louizos, Tijmen Blankevoort

    Abstract: When quantizing neural networks, assigning each floating-point weight to its nearest fixed-point value is the predominant approach. We find that, perhaps surprisingly, this is not the best we can do. In this paper, we propose AdaRound, a better weight-rounding mechanism for post-training quantization that adapts to the data and the task loss. AdaRound is fast, does not require fine-tuning of the n… ▽ More

    Submitted 30 June, 2020; v1 submitted 22 April, 2020; originally announced April 2020.

    Comments: Published as a conference paper at ICML 2020

  15. arXiv:2002.07520  [pdf, other

    cs.LG stat.ML

    Gradient $\ell_1$ Regularization for Quantization Robustness

    Authors: Milad Alizadeh, Arash Behboodi, Mart van Baalen, Christos Louizos, Tijmen Blankevoort, Max Welling

    Abstract: We analyze the effect of quantizing weights and activations of neural networks on their loss and derive a simple regularization scheme that improves robustness against post-training quantization. By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths as energy and memory requirements of the application c… ▽ More

    Submitted 18 February, 2020; originally announced February 2020.

    Comments: ICLR 2020

  16. arXiv:1906.04721  [pdf, other

    cs.LG cs.CV stat.ML

    Data-Free Quantization Through Weight Equalization and Bias Correction

    Authors: Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling

    Abstract: We introduce a data-free quantization method for deep neural networks that does not require fine-tuning or hyperparameter selection. It achieves near-original model performance on common computer vision architectures and tasks. 8-bit fixed-point quantization is essential for efficient inference on modern deep learning hardware. However, quantizing models to run in 8-bit is a non-trivial task, freq… ▽ More

    Submitted 25 November, 2019; v1 submitted 11 June, 2019; originally announced June 2019.

    Comments: ICCV 2019

    Journal ref: The IEEE International Conference on Computer Vision (ICCV), 2019