-
Exploring Context Window of Large Language Models via Decomposed Positional Vectors
Authors:
Zican Dong,
Junyi Li,
Xin Men,
Wayne Xin Zhao,
Bingbing Wang,
Zhen Tian,
Weipeng Chen,
Ji-Rong Wen
Abstract:
Transformer-based large language models (LLMs) typically have a limited context window, resulting in significant performance degradation when processing text beyond the length of the context window. Extensive studies have been proposed to extend the context window and achieve length extrapolation of LLMs, but there is still a lack of in-depth interpretation of these approaches. In this study, we e…
▽ More
Transformer-based large language models (LLMs) typically have a limited context window, resulting in significant performance degradation when processing text beyond the length of the context window. Extensive studies have been proposed to extend the context window and achieve length extrapolation of LLMs, but there is still a lack of in-depth interpretation of these approaches. In this study, we explore the positional information within and beyond the context window for deciphering the underlying mechanism of LLMs. By using a mean-based decomposition method, we disentangle positional vectors from hidden states of LLMs and analyze their formation and effect on attention. Furthermore, when texts exceed the context window, we analyze the change of positional vectors in two settings, i.e., direct extrapolation and context window extension. Based on our findings, we design two training-free context window extension methods, positional vector replacement and attention window extension. Experimental results show that our methods can effectively extend the context window length.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Base of RoPE Bounds Context Length
Authors:
Xin Men,
Mingyu Xu,
Bingning Wang,
Qingyu Zhang,
Hongyu Lin,
Xianpei Han,
Weipeng Chen
Abstract:
Position embedding is a core component of current Large Language Models (LLMs). Rotary position embedding (RoPE), a technique that encodes the position information with a rotation matrix, has been the de facto choice for position embedding in many LLMs, such as the Llama series. RoPE has been further utilized to extend long context capability, which is roughly based on adjusting the \textit{base}…
▽ More
Position embedding is a core component of current Large Language Models (LLMs). Rotary position embedding (RoPE), a technique that encodes the position information with a rotation matrix, has been the de facto choice for position embedding in many LLMs, such as the Llama series. RoPE has been further utilized to extend long context capability, which is roughly based on adjusting the \textit{base} parameter of RoPE to mitigate out-of-distribution (OOD) problems in position embedding. However, in this paper, we find that LLMs may obtain a superficial long-context ability based on the OOD theory. We revisit the role of RoPE in LLMs and propose a novel property of long-term decay, we derive that the \textit{base of RoPE bounds context length}: there is an absolute lower bound for the base value to obtain certain context length capability. Our work reveals the relationship between context length and RoPE base both theoretically and empirically, which may shed light on future long context training.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
Authors:
Xin Men,
Mingyu Xu,
Qingyu Zhang,
Bingning Wang,
Hongyu Lin,
Yaojie Lu,
Xianpei Han,
Weipeng Chen
Abstract:
As Large Language Models (LLMs) continue to advance in performance, their size has escalated significantly, with current LLMs containing billions or even trillions of parameters. However, in this study, we discovered that many layers of LLMs exhibit high similarity, and some layers play a negligible role in network functionality. Based on this observation, we define a metric called Block Influence…
▽ More
As Large Language Models (LLMs) continue to advance in performance, their size has escalated significantly, with current LLMs containing billions or even trillions of parameters. However, in this study, we discovered that many layers of LLMs exhibit high similarity, and some layers play a negligible role in network functionality. Based on this observation, we define a metric called Block Influence (BI) to gauge the significance of each layer in LLMs. We then propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers in LLMs based on their BI scores. Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning. Moreover, ShortGPT is orthogonal to quantization-like methods, enabling further reduction in parameters and computation. The ability to achieve better results through simple layer removal, as opposed to more complex pruning techniques, suggests a high degree of redundancy in the model architecture.
△ Less
Submitted 7 March, 2024; v1 submitted 6 March, 2024;
originally announced March 2024.
-
Baichuan 2: Open Large-scale Language Models
Authors:
Aiyuan Yang,
Bin Xiao,
Bingning Wang,
Borong Zhang,
Ce Bian,
Chao Yin,
Chenxu Lv,
Da Pan,
Dian Wang,
Dong Yan,
Fan Yang,
Fei Deng,
Feng Wang,
Feng Liu,
Guangwei Ai,
Guosheng Dong,
Haizhou Zhao,
Hang Xu,
Haoze Sun,
Hongda Zhang,
Hui Liu,
Jiaming Ji,
Jian Xie,
JunTao Dai,
Kun Fang
, et al. (30 additional authors not shown)
Abstract:
Large language models (LLMs) have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering. However, most powerful LLMs are closed-source or limited in their capability for languages other than English. In this technical report, we present Baichuan 2, a series of lar…
▽ More
Large language models (LLMs) have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering. However, most powerful LLMs are closed-source or limited in their capability for languages other than English. In this technical report, we present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval. Furthermore, Baichuan 2 excels in vertical domains such as medicine and law. We will release all pre-training model checkpoints to benefit the research community in better understanding the training dynamics of Baichuan 2.
△ Less
Submitted 20 September, 2023; v1 submitted 19 September, 2023;
originally announced September 2023.