Zum Hauptinhalt springen

Showing 1–13 of 13 results for author: Höper, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.00608  [pdf, other

    cs.CL cs.LG

    TinyAgent: Function Calling at the Edge

    Authors: Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Hooper, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami

    Abstract: Recent large language models (LLMs) have enabled the development of advanced agentic systems that can integrate various tools and APIs to fulfill user queries through function calling. However, the deployment of these LLMs on the edge has not been explored since they typically require cloud-based infrastructure due to their substantial model size and computational demands. To this end, we present… ▽ More

    Submitted 1 September, 2024; originally announced September 2024.

  2. arXiv:2407.13463  [pdf, other

    cs.CL cs.AI

    End-To-End Clinical Trial Matching with Large Language Models

    Authors: Dyke Ferber, Lars Hilgers, Isabella C. Wiest, Marie-Elisabeth Leßmann, Jan Clusmann, Peter Neidlinger, Jiefu Zhu, Georg Wölflein, Jacqueline Lammert, Maximilian Tschochohei, Heiko Böhme, Dirk Jäger, Mihaela Aldea, Daniel Truhn, Christiane Höper, Jakob Nikolas Kather

    Abstract: Matching cancer patients to clinical trials is essential for advancing treatment and patient care. However, the inconsistent format of medical free text documents and complex trial eligibility criteria make this process extremely challenging and time-consuming for physicians. We investigated whether the entire trial matching process - from identifying relevant trials among 105,600 oncology-related… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: 149 pages, including Supplements. 3 Main Figures

  3. arXiv:2403.14123  [pdf, other

    cs.LG cs.AR cs.DC

    AI and Memory Wall

    Authors: Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, Kurt Keutzer

    Abstract: The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck is increasingly shifting to memory bandwidth. Over the past 20 years, peak server hardware FLOPS has been scaling at 3.0x/2yrs, outpacing the growth of DRAM and… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

    Comments: Published in IEEE Micro Journal

  4. arXiv:2401.18079  [pdf, other

    cs.LG

    KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

    Authors: Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami

    Abstract: LLMs are seeing growing use for applications such as document analysis and summarization which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurat… ▽ More

    Submitted 4 July, 2024; v1 submitted 31 January, 2024; originally announced January 2024.

  5. arXiv:2401.07886  [pdf, other

    cs.LG cs.AI cs.CL cs.DC

    Learned Best-Effort LLM Serving

    Authors: Siddharth Jha, Coleman Hooper, Xiaoxuan Liu, Sehoon Kim, Kurt Keutzer

    Abstract: Many applications must provide low-latency LLM service to users or risk unacceptable user experience. However, over-provisioning resources to serve fluctuating request patterns is often prohibitively expensive. In this work, we present a best-effort serving system that employs deep reinforcement learning to adjust service quality based on the task distribution and system load. Our best-effort syst… ▽ More

    Submitted 14 July, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

    Comments: Es-FoMo @ ICML 2024

  6. arXiv:2311.03285  [pdf, other

    cs.LG cs.AI cs.DC

    S-LoRA: Serving Thousands of Concurrent LoRA Adapters

    Authors: Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica

    Abstract: The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched in… ▽ More

    Submitted 5 June, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

  7. arXiv:2310.12371  [pdf, other

    eess.AS cs.SD

    Property-Aware Multi-Speaker Data Simulation: A Probabilistic Modelling Technique for Synthetic Data Generation

    Authors: Tae Jin Park, He Huang, Coleman Hooper, Nithin Koluguri, Kunal Dhawan, Ante Jukic, Jagadeesh Balam, Boris Ginsburg

    Abstract: We introduce a sophisticated multi-speaker speech data simulator, specifically engineered to generate multi-speaker speech recordings. A notable feature of this simulator is its capacity to modulate the distribution of silence and overlap via the adjustment of statistical parameters. This capability offers a tailored training environment for developing neural models suited for speaker diarization… ▽ More

    Submitted 18 October, 2023; originally announced October 2023.

    Journal ref: CHiME-7 Workshop 2023

  8. arXiv:2310.12072  [pdf, other

    cs.CL

    SPEED: Speculative Pipelined Execution for Efficient Decoding

    Authors: Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, Sophia Shao

    Abstract: Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios has been highly restricted due to the significant inference latency associated with these models. This is particularly pronounced due to the autoregressive nat… ▽ More

    Submitted 2 January, 2024; v1 submitted 18 October, 2023; originally announced October 2023.

    Comments: NeurIPS Workshop on Efficient Natural Language and Speech Processing (2023)

  9. arXiv:2306.07629  [pdf, other

    cs.CL cs.LG

    SqueezeLLM: Dense-and-Sparse Quantization

    Authors: Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer

    Abstract: Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models.… ▽ More

    Submitted 4 June, 2024; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: ICML 2024

  10. arXiv:2302.14017  [pdf, other

    cs.CL cs.LG

    Full Stack Optimization of Transformer Inference: a Survey

    Authors: Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami

    Abstract: Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has been consistent over the past several years since Transformer models were originally introduced. However, the amount of compute and bandwidth required for inference of recent Transformer models is growing… ▽ More

    Submitted 27 February, 2023; originally announced February 2023.

    Journal ref: Presented in Workshop on Architecture and System Support for Transformer Models (ASSYST) at ISCA 2023

  11. arXiv:2105.01134  [pdf, other

    eess.AS cs.SD

    Quantifying and Maximizing the Benefits of Back-End Noise Adaption on Attention-Based Speech Recognition Models

    Authors: Coleman Hooper, Thierry Tambe, Gu-Yeon Wei

    Abstract: This work analyzes how attention-based Bidirectional Long Short-Term Memory (BLSTM) models adapt to noise-augmented speech. We identify crucial components for noise adaptation in BLSTM models by freezing model components during fine-tuning. We first freeze larger model subnetworks and then pursue a fine-grained freezing approach in the encoder after identifying its importance for noise adaptation.… ▽ More

    Submitted 23 September, 2021; v1 submitted 3 May, 2021; originally announced May 2021.

    Comments: Submitted to ENLSP 2021

  12. arXiv:2011.14203  [pdf, other

    cs.AR cs.CL

    EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

    Authors: Thierry Tambe, Coleman Hooper, Lillian Pentecost, Tianyu Jia, En-Yu Yang, Marco Donato, Victor Sanh, Paul N. Whatmough, Alexander M. Rush, David Brooks, Gu-Yeon Wei

    Abstract: Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. We present EdgeBERT, an in-depth algorithm-hardware co-design for latency-aware energy optimi… ▽ More

    Submitted 5 September, 2021; v1 submitted 28 November, 2020; originally announced November 2020.

    Comments: 12 pages plus references. Paper to appear at the 54th IEEE/ACM International Symposium on Microarchitecture (MICRO 2021)

  13. arXiv:1307.0516  [pdf, other

    cs.SI nlin.AO physics.soc-ph q-bio.PE

    Dynamical Structure of a Traditional Amazonian Social Network

    Authors: Paul L. Hooper, Simon DeDeo, Ann E. Caldwell Hooper, Michael Gurven, Hillard S. Kaplan

    Abstract: Reciprocity is a vital feature of social networks, but relatively little is known about its temporal structure or the mechanisms underlying its persistence in real world behavior. In pursuit of these two questions, we study the stationary and dynamical signals of reciprocity in a network of manioc beer (Spanish: chicha; Tsimane': shocdye') drinking events in a Tsimane' village in lowland Bolivia.… ▽ More

    Submitted 17 November, 2013; v1 submitted 1 July, 2013; originally announced July 2013.

    Comments: 24 pages, 6 figures, 1 table; expanded results and discussion sections; matches published version

    Journal ref: Entropy 2013, 15, 4932-4955