Zum Hauptinhalt springen

Showing 1–3 of 3 results for author: Duanmu, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.06219  [pdf, other

    cs.LG cs.CL

    SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models

    Authors: Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, Dahua Lin

    Abstract: Large language models (LLMs) can now handle longer sequences of tokens, enabling complex tasks like book understanding and generating lengthy novels. However, the key-value (KV) cache required for LLMs consumes substantial memory as context length increasing, becoming the bottleneck for deployment. In this paper, we present a strategy called SKVQ, which stands for sliding-window KV cache quantizat… ▽ More

    Submitted 13 May, 2024; v1 submitted 9 May, 2024; originally announced May 2024.

  2. arXiv:2404.02015  [pdf, other

    cs.DC

    MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

    Authors: Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang

    Abstract: Large language models (LLMs) have demonstrated remarkable performance, and organizations are racing to serve LLMs of varying sizes as endpoints for use-cases like chat, programming and search. However, efficiently serving multiple LLMs poses significant challenges for existing approaches due to varying popularity of LLMs. In the paper, we present MuxServe, a flexible spatial-temporal multiplexing… ▽ More

    Submitted 12 June, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

  3. arXiv:2402.12065  [pdf, other

    cs.LG cs.AI cs.CL

    WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

    Authors: Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, Liqiang Nie

    Abstract: Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers. We critically analyz… ▽ More

    Submitted 20 February, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: Frist work to exclusively quantize weight and Key/Value cache for large language models