Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

Li, Luchang; Qian, Sheng; Lu, Jie; Yuan, Lunxi; Wang, Rui; Xie, Qin

Computer Science > Computation and Language

arXiv:2403.20041 (cs)

[Submitted on 29 Mar 2024 (v1), last revised 5 Jul 2024 (this version, v3)]

Title:Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

Authors:Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie

View PDF

Abstract:The Large Language Model (LLM) is widely employed for tasks such as intelligent assistants, text summarization, translation, and multi-modality on mobile phones. However, the current methods for on-device LLM deployment maintain slow inference speed, which causes poor user experience. To facilitate high-efficiency LLM deployment on device GPUs, we propose four optimization techniques: (a) a symbolic expression-based approach to support dynamic shape model inference; (b) operator optimizations and execution priority setting to enhance inference speed and reduce phone lagging; (c) an FP4 quantization method termed M0E4 to reduce dequantization overhead; (d) a sub-tensor-based technique to eliminate the need for copying KV cache after LLM inference. Furthermore, we implement these methods in our mobile inference engine, Transformer-Lite, which is compatible with both Qualcomm and MTK processors. We evaluated Transformer-Lite's performance using LLMs with varied architectures and parameters ranging from 2B to 14B. Specifically, we achieved prefill and decoding speeds of 121 token/s and 14 token/s for ChatGLM2 6B, and 330 token/s and 30 token/s for smaller Gemma 2B, respectively. Compared with CPU-based FastLLM and GPU-based MLC-LLM, our engine attains over 10x speedup for the prefill speed and 2~3x speedup for the decoding speed.

Comments:	21 pages, 6 figures, fix "E0M4" spell mistake, fix FLOPS to TFLOPS
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2403.20041 [cs.CL]
	(or arXiv:2403.20041v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2403.20041

Submission history

From: Luchang Li [view email]
[v1] Fri, 29 Mar 2024 08:26:53 UTC (654 KB)
[v2] Tue, 21 May 2024 01:21:19 UTC (654 KB)
[v3] Fri, 5 Jul 2024 07:18:42 UTC (654 KB)

Computer Science > Computation and Language

Title:Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators