¹¹institutetext: Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
¹¹email: {jacn,petersk}@imada.sdu.dk

BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks

Jacob Nielsen 11 0009-0009-8141-630X
Peter Schneider-Kamp 11 0000-0003-4000-5570

Abstract

Recently proposed methods for 1-bit and 1.58-bit quantization aware training investigate the performance and behavior of these methods in the context of large language models, finding state-of-the-art performance for models with more than 3B parameters. In this work, we investigate 1.58-bit quantization for small language and vision models ranging from 100K to 48M parameters. We introduce a variant of BitNet b1.58, which allows to rely on the median rather than the mean in the quantization process. Through extensive experiments we investigate the performance of 1.58-bit models obtained through quantization aware training. We further investigate the robustness of 1.58-bit quantization-aware training to changes in the learning rate and regularization through weight decay, finding different patterns for small language and vision models than previously reported for large language models. Our results showcase that 1.58-bit quantization-aware training provides state-of-the-art performance for small language models when doubling hidden layer sizes and reaches or even surpasses state-of-the-art performance for small vision models of identical size. Ultimately, we demonstrate that 1.58-bit quantization-aware training is a viable and promising approach also for training smaller deep learning networks, facilitating deployment of such models in low-resource use-cases and encouraging future research.

Keywords:

deep learning quantization-aware training green machine learning small language models image classification.

1 Introduction

The recent years of development of natural language processing (NLP) have been dominated by the capabilities offered by Large Language Models (LLMs). However, due to the size of these models, they pose a challenge in deployment and raise concerns regarding the environmental impact. Post-training quantisation methods transform the 16-bit weights to a lower bit-representation, which both reduces the memory and computational needs. The idea is to take the trained weights and find a good way of mapping them to fewer bits, enabling more efficient inference.

Several post-training quantisation methods have been proposed, including but not limited to, Generative Pre-trained Transformer Quantization [5] and Activation-aware Weight Quantization [9]. However, post-training quantization inherently comes at the cost of precision. Post-training quantization has also been employed in other domain such as vision models [8].

An alternative to post-training quantization is quantization-aware training such as LLM-QAT [10] and QA-LoRA. Here, as the training optimizes the quantized weights, there is no loss of precision when using the quantized model for inference. Recent works on 1-bit [13] and 1.58-bit [11] quantization-aware training architectures have demonstrated the potential of training in very low-bit representation while still maintaining most or all of the performance for LLMs.

The 1.58-bit quantization aware training architecture BitNet b1.58[11] proposes a solution based on replacing linear 16-bit layers with layers where the weights only assume the values $-1$ , $0$ , and $1$ . Notably, for large-enough LLMs, BitNet b1.58 can match the 16-bit precision baselines both in capacity and performance. From above 3B parameters, the 1.58-bit models trained from scratch perform just as well as 16-bit models.

In this work we investigate 1.58-bit quantization aware training for small language models (SLMs) and vision models ranging from 100K to 48M parameters. We introduce a variant of BitNet b1.58 that relies on the median rather than the mean of the absolute values of the weights. Through extensive experiments we investigate and compare the scaling, the learning-rate robustness, and the regularization properties of both 1.58-bit variants. Our work demonstrates that 1.58-bit quantization aware training can get close to state-of-the-art performance on SLMs and even exceed the state-of-the-art performance on vision models, opening a new avenue for research in this direction. This facilitates the deployment of SLMs and small vision models in low-ressource settings. Our implementation is available from GitHub¹¹1https://github.com/schneiderkamplab/bitlinear and the Python Packacking Index²²2https://pypi.org/project/bitlinear/.

Refer to caption — Figure 1: The BitLinear layer is the backbone of the BitNet 1.58 Bits Reloaded architecture. It provides a drop-in replacement for linear layers (often referred to as feed-forward networks or multi-level perceptrons) in any architecture. *AbsMeasure* denotes the mean oder median of the absolute values of the weight. The two factors $x_{scale}$ and $w_{scale}$ denote two scaling factors for the input and 16-bit weights respectively, used in the dequantization. We employ a straight-through estimator for the backward computations of the gradients.

2 Method

In this section we present our quantization aware training architecture as a generalization of the BitNet b1.58 architecture [11]. First, we present our quantization method. Then, we document our experimental setup.

2.1 b1.58 Quantization

Our BitLinear layer functions as a drop-in replacement for PyTorch’s torch.nn.Linear layer. Figure 1 illustrates BitLinear’s 5-step computation flow:

1.

The activations are normalized.
2.

The normalized activations are quantized to k-bit precision.
3.

The 16-bit shadow weights are quantized to 1.58-bit weights.
4.

The quantized activations are multiplied with the 1.58-bit weights.
5.

The result of the multiplication is dequantized by rescaling.

In the following, we details the mathematics behind this computation flow. We denote the Layer normalization [4] of input $I$ , as $\hat{I}$ . We then define the quantified activation-bits as $x_{scale}$ , constituting the AbsMax:

x_{scale}=\frac{Q_{b}}{max(|\hat{I}|)+\epsilon}

(1)

where $Q_{b}=2^{k-1}$ is the range of the k bits used for the quantized activation. $\epsilon$ is s small value preventing zero-division. This means all activations can be scaled to integer values $\{-Q_{b}-1,\ldots,Q_{b}\}$ . We define the AbsMax Quantization for the activations as follows:

x_{quant}=max(-Q_{B},min(Q_{B}-1,round(\hat{I}\cdot x_{scale}))

(2)

Furthermore, we quantize the 16-bit weights $W\in\mathcal{R}^{n\times m}$ to a ternary system of integer values $\{-1,0,1\}$ as follows. We define the scaling of $W$ as:

w_{scale}=\frac{1}{Measure(|W|)+\epsilon}

(3)

Where $Measure$ denotes either the mean or median function, constituting the AbsMeasure Quantization.

We define the quantized weights $W_{quant}$ (denoted as 1.58-Bit Weights in Figure 1) as:

W_{quant}=max(-1,min(1,round(W\cdot w_{scale}))

(4)

Having quantized both the activations and the weights, we can apply a kernel with $q_{quant}$ and $w_{quant}$ as inputs:

y_{quant}=x_{quant}\cdot W_{quant}+b

(5)

where $b$ is optional bias. We detach both $x_{quant}$ and $w_{quant}$ from the computation graph to achieve a straight-through estimation of the gradients. The gradients update the “shadow weights”, i.e., the 16-Bit Weights that are quantized by AbsMeasure Quantization.

Finally, we rescale the output $y$ during the Dequantization process:

y=\frac{y_{quant}}{w_{scale}\cdot x_{scale}}

(6)

Comparing to the original BitNet b1.58, there are a number of differences:

•

We chose to use a standard layer normalization (LayerNorm) rather than RMS normalization, as the computational overhead is minimal and we observed slightly better performance with the standard layer norm in preliminary experiments.
•

We allow the use of both the median and the mean for quantizing weights. Prior works [13, 11] solely employ the mean. We investigate the impact of this choice in Section 3.
•

We actually quantize weights and activations to integer values. This means the matrix multiplications are performed between the 1.58-bit weights with integer values $\{-1,0,1\}$ and the 8-bit quantized activations with integer values ${-128,\ldots,127}$ . This allows to develop multiplication-free kernels, as multiplication with $-1$ corresponds to the subtraction of an 8-bit integer value, multiplication with $0$ to the disregard of a value, and multiplication with $1$ to the addition of an 8-bit integer value.

This is in contract to previous work [11], where the quantized weights have floating point values $\{\frac{-1}{w_{scale}},0,\frac{1}{w_{scale}}\}$ while quantized activations have floating point values $\{\frac{-128}{x_{scale}},\ldots,\frac{127}{x_{scale}}\}$ according to the published information about the implementation[1]. Consequently, our BitNet 1.58 Bits Reloaded architecture is more directly amenable to custom software kernels and hardware implementations.

2.2 Experimental setup

We conduct all experiments with standard networks in small configurations with the torch.nn.Linear layers replaced by our BitLinear layers. The Adam[6] optimizer and a batch-size of 128 are employed. The number of model parameters is slightly higher in the BitLinear setting, as we both have 1.58-bit weights as well as the 16-bit shadow weights. However, this fact does not change the number of trainable/optimized parameters in practice.

For SLMs, we train small Mistral-like models with 4 layers and hidden sizes of $32$ , $64$ , $128$ , and $256$ . The number of attention head and key-value cache heads is set to the ceiling of the hidden size divided by 64, i.e., $1$ head for $32$ and $64$ hidden sizes and $2$ and $4$ heads for $128$ and $256$ , respectively. The resulting models sizes are 6M, 12M, 24M, and 48M parameters. We use a text corpus of 135M tokens and train from scratch for 10 epochs unless otherwise noted, corresponding to a total of 1.35B tokens for each training. We trained a Byte Pair Encoding tokenizer with a vocabulary size of $8{,}000$ . The experiments are conducted with the standard trainer from the Hugging Face transformers library³³3https://github.com/huggingface/transformers.

For vision models, we consider a standard serial implementation of classifier for MNIST and standard CNN-based implementations for CIFAR-10, and CIFAR-100. The model for MNIST is the smallest in this paper with only 100K parameters. The CIFAR-10 and CIFAR-100 models represent the but smallest models with 2.1M and 2.2M, respectively. The difference in model size is explained by CIFAR-10 having 10 classes and CIFAR-100 having 100 classes. The experiments are based on Pytorch Lightning⁴⁴4https://github.com/Lightning-AI/pytorch-lightning and use torchvision’s⁵⁵5https://pytorch.org/vision/stable/index.html versions of the datasets.

The MNIST [2] dataset consists of 60.000 train and 10.000 test samples. The CIFAR10 [7] and CIFAR100 [7] datasets both contains 50.000 train and 10.000 test samples. All models are trained from scratch. We calculate the accuracy as the mean of the percentage of correct batches across the test set.

3 Results

In this section, we present a comparison of our BitLinear implementation with 16-bit floating point torch.nn.Layer, showing close-to-state-of-the-art performance on SLMs and better-than-state-of-the-art performance for vision models. We also perform ablation studies on the learning rate and weight decay hyperparameters, as well as the choice of mean vs median for the quantization of the weights.

3.1 Small Language Models

The first experiment for SLMs is a scaling experiment, where we perform 16-bit and 1.58-bit training on all four model sizes. The second experiment is a hyperparameter tuning for the learning rate and weight decay in a 12M SLM, with a fixed hidden size of 64. We show the results of the first and second experiment in Tables 3.1 and 3.1, respectively.

Both tables show the different configurations and perplexities after 10 epochs. For most configurations, the training has converged or is close to convergence at the end of the experiment. It is important to keep in mind that the reported perplexity is the exponentiation of the entropy, i.e., here the exponentiation of the loss defined via cross-entropy. Thus, minor changes in the loss result in quite discernible changes to the perplexity.

The first two columns give the hidden layer size and the number of parameters. The third column provides the bit-depth and implementation: “16” stands for 16-bit training, “1.58-mean” for our BitLinear implementation with 1.58 bits and AbsMean quantization of weights, and “1.58-median” for our BitLinear implementation with 1.58 bits and AbsMedian quantization of weights.

We show the results of this first experiment in Figure 2(d). Figure 2(a)) show that the 16-bit training scales exactly as expected when the number of hidden layers, and thus the models capacity, increases. We see in Figure 2(b), that 1.58-bit training follows the same trend, albeit with slightly lower performance. In Figure 2(c), we can visually compare the scaling between 16-bit training for models with 32 and 64 hidden sizes and 1.58-bit training for models with 64 and 128 hidden sizes. The observed perplexities suggest that the effective capacities of the models with 1.58-bit weights are around half that of the models with 16-bit weights, i.e., that hidden layers of approximately double size are needed for 1.58-bit models to reach performance comparable with the 16-bit counterparts. Figure 2(d) shows that the median generally converges slower than the mean over the employed weight decays. We discuss this fact in Section 4.

The fourth column shows the learning rate. For 16-bit training, we took a high but stable learning rate of 0.001 (1e-3). For 1.58-bit, we used the same or a larger learning rate of 0.01 (1e-2), as 1.58-bit training has been found to be more robust to higher learning rate in the context of LLMs[11]. The fifth column shows the weight decay. We tried both a small but noticeable decay of 5%, which is pretty prevalent in the pre-training and fine-tuning of LLMs, and no weight decay. Both Tables 3.1 and 3.1 hint that a weight decay of 5% yields the best or similar performance compared to other values of weight decay. This is also visualized in Figures 3(a), 3(b), and 3(c), where trainings with a weight decay of 5% are represented by a red line.

The sixth column provides the perplexity after 10 epochs. For nearly all configurations, after 10 epochs the training had converged. The best perplexities for 16-bit and 1.58-bit training are marked in bold, respectively. The seventh and last column shows the number of epochs, with total training length corresponding to 135M tokens per epoch, i.e., 1.35B tokens per 10-epoch experiment.

BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks

Abstract

Keywords:

1 Introduction

2 Method

2.1 b1.58 Quantization

2.2 Experimental setup

3 Results

3.1 Small Language Models

3.2 Small Vision Models

4 Discussion

5 Conclusion

References

#Hidden	#Params	Bits	Learning Rate	Weight Decay	Perplexity	Epochs
32	6M	16	0.001	0.00	77.8	10
		16	0.001	0.05	81.0	10
		1.58-mean	0.001	0.00	166.2	10
			0.001	0.05	164.9	10
			0.01	0.00	130.1	10
			0.01	0.05	134.4	10
\cdashline3-7		1.58-median	0.001	0.00	183.6	10
			0.001	0.05	183.8	10
			0.01	0.00	116.6	10
			0.01	0.05	118.0	10
64	12M	16	0.001	0.00	36.7	10
		16	0.001	0.05	37.5	10
		1.58-mean	0.001	0.00	67.4	10
			0.001	0.05	68.2	10
			0.01	0.00	76.3	10
			0.01	0.05	68.2	10
\cdashline3-7		1.58-median	0.001	0.00	76.5	10
			0.001	0.05	68.1	10
			0.01	0.00	61.1	10
			0.01	0.05	60.0	10
128	24M	16	0.001	0.00	22.3	10
		16	0.001	0.05	21.4	10
		1.58-mean	0.001	0.00	36.8	10
			0.001	0.05	36.3	10
			0.01	0.00	61.6	10
			0.01	0.05	71.0	10
\cdashline3-7		1.58-median	0.001	0.00	39.8	10
			0.001	0.05	37.5	10
			0.01	0.00	42.3	10
			0.01	0.05	38.4	10
256	48M	16	0.001	0.00	16.6	10
		16	0.001	0.05	16.7	10
		1.58-mean	0.001	0.00	28.7	10
			0.001	0.05	27.1	10
			0.01	0.00	77.7	10
			0.01	0.05	65.6	10
\cdashline3-7		1.58-median	0.001	0.00	26.8	10
			0.001	0.05	27.5	10
			0.01	0.00	65.1	10
			0.01	0.05	63.8	10

Dataset	#Params	Bits	Learning Rate	Weight Decay	Test Accuracy	Epochs
MNIST	100K	16	0.0001	0.00	92.29	10
			0.001	0.00	96.93	10
				0.01	93.35	10
				0.05	77.06	10
		1.58-mean	0.0001	0.00	95.63	10
			0.0001	0.01	96.08	10
			0.001	0.00	96.01	10
				0.01	93.11	10
				0.05	86.57	10
			0.01	0.00	94.59	10
			0.05	0.00	93.80	10
			0.10	0.00	93.15	10
\cdashline3-7		1.58-median	0.0001	0.00	94.14	10
			0.0001	0.01	95.93	10
			0.001	0.00	95.80	10
				0.01	91.27	10
				0.05	89.15	10
			0.01	0.00	93.03	10
			0.05	0.00	86.61	10
			0.10	0.00	52.35	10
CIFAR10	2.1M	16	0.0001	0.00	60.86	10
			0.001	0.00	70.06	10
				0.01	58.32	10
				0.05	10.0	10
				0.10	10.0	10
		1.58-mean	0.0001	0.00	68.94	10
				0.01	69.1	10
				0.05	71.47	10
			0.001	0.00	70.35	10
				0.01	69.08	10
				0.05	58.04	10
			0.01	0.00	63.92	10
			0.05	0.00	25.01	10
			0.10	0.00	23.05	10
\cdashline3-7		1.58-median	0.0001	0.00	69.08	10
				0.01	69.55	10
				0.05	70.25	10
			0.001	0.00	71.21	10
				0.01	69.80	10
				0.05	60.61	10
			0.01	0.00	65.80	10
			0.05	0.00	54.77	10
			0.10	0.00	49.48	10