Zum Hauptinhalt springen

Showing 1–6 of 6 results for author: Pandey, N P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.16712  [pdf, other

    cs.LG

    Rapid Switching and Multi-Adapter Fusion via Sparse High Rank Adapters

    Authors: Kartikeya Bhardwaj, Nilesh Prasad Pandey, Sweta Priyadarshi, Viswanath Ganapathy, Rafael Esteves, Shreya Kadambi, Shubhankar Borse, Paul Whatmough, Risheek Garrepalli, Mart Van Baalen, Harris Teague, Markus Nagel

    Abstract: In this paper, we propose Sparse High Rank Adapters (SHiRA) that directly finetune 1-2% of the base model weights while leaving others unchanged, thus, resulting in a highly sparse adapter. This high sparsity incurs no inference overhead, enables rapid switching directly in the fused mode, and significantly reduces concept-loss during multi-adapter fusion. Our extensive experiments on LVMs and LLM… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: Published at ICML 2024 Workshop on Foundation Models in the Wild. arXiv admin note: substantial text overlap with arXiv:2406.13175

  2. arXiv:2406.13175  [pdf, other

    cs.LG cs.AI

    Sparse High Rank Adapters

    Authors: Kartikeya Bhardwaj, Nilesh Prasad Pandey, Sweta Priyadarshi, Viswanath Ganapathy, Rafael Esteves, Shreya Kadambi, Shubhankar Borse, Paul Whatmough, Risheek Garrepalli, Mart Van Baalen, Harris Teague, Markus Nagel

    Abstract: Low Rank Adaptation (LoRA) has gained massive attention in the recent generative AI research. One of the main advantages of LoRA is its ability to be fused with pretrained models adding no overhead during inference. However, from a mobile deployment standpoint, we can either avoid inference overhead in the fused mode but lose the ability to switch adapters rapidly, or suffer significant (up to 30%… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  3. arXiv:2406.08798  [pdf, other

    cs.CV

    FouRA: Fourier Low Rank Adaptation

    Authors: Shubhankar Borse, Shreya Kadambi, Nilesh Prasad Pandey, Kartikeya Bhardwaj, Viswanath Ganapathy, Sweta Priyadarshi, Risheek Garrepalli, Rafael Esteves, Munawar Hayat, Fatih Porikli

    Abstract: While Low-Rank Adaptation (LoRA) has proven beneficial for efficiently fine-tuning large models, LoRA fine-tuned text-to-image diffusion models lack diversity in the generated images, as the model tends to copy data from the observed training samples. This effect becomes more pronounced at higher values of adapter strength and for adapters with higher ranks which are fine-tuned on smaller datasets… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  4. arXiv:2403.18159  [pdf, other

    cs.LG cs.AI cs.CL

    Oh! We Freeze: Improving Quantized Knowledge Distillation via Signal Propagation Analysis for Large Language Models

    Authors: Kartikeya Bhardwaj, Nilesh Prasad Pandey, Sweta Priyadarshi, Kyunggeun Lee, Jun Ma, Harris Teague

    Abstract: Large generative models such as large language models (LLMs) and diffusion models have revolutionized the fields of NLP and computer vision respectively. However, their slow inference, high computation and memory requirement makes it challenging to deploy them on edge devices. In this study, we propose a light-weight quantization aware fine tuning technique using knowledge distillation (KD-QAT) to… ▽ More

    Submitted 28 March, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

    Comments: Accepted at Practical ML for Low Resource Settings Workshop at ICLR 2024

  5. arXiv:2309.01729  [pdf, other

    cs.LG cs.AI cs.CV

    Softmax Bias Correction for Quantized Generative Models

    Authors: Nilesh Prasad Pandey, Marios Fournarakis, Chirag Patel, Markus Nagel

    Abstract: Post-training quantization (PTQ) is the go-to compression technique for large generative models, such as stable diffusion or large language models. PTQ methods commonly keep the softmax activation in higher precision as it has been shown to be very sensitive to quantization noise. However, this can lead to a significant runtime and power overhead during inference on resource-constraint edge device… ▽ More

    Submitted 4 September, 2023; originally announced September 2023.

  6. arXiv:2302.05397  [pdf, other

    cs.LG

    A Practical Mixed Precision Algorithm for Post-Training Quantization

    Authors: Nilesh Prasad Pandey, Markus Nagel, Mart van Baalen, Yin Huang, Chirag Patel, Tijmen Blankevoort

    Abstract: Neural network quantization is frequently used to optimize model size, latency and power consumption for on-device deployment of neural networks. In many cases, a target bit-width is set for an entire network, meaning every layer get quantized to the same number of bits. However, for many networks some layers are significantly more robust to quantization noise than others, leaving an important axi… ▽ More

    Submitted 10 February, 2023; originally announced February 2023.