Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators

P. D’​Alberto    T. Jeong    A. Jain    S. Manjunath    M. Sarmah    S. Hsu    Y. Raparti    N. Pipralia
Abstract

Nowadays, increasingly larger Deep Neural Networks (DNNs) are being developed, trained, and utilized. These networks require significant computational resources, putting a strain on both advanced and limited devices. Our solution is to implement weight block sparsity, which is a structured sparsity that is friendly to hardware. By zeroing certain sections of the convolution and fully connected layers parameters of pre-trained DNN models, we can efficiently speed up the DNN’s inference process. This results in a smaller memory footprint, faster communication, and fewer operations.

Our work presents a vertical system that allows for the training of convolution and matrix multiplication weights to exploit 8x8 block sparsity on a single GPU within a reasonable amount of time. Compilers recognize this sparsity and use it for both data compaction and computation splitting into threads. Blocks like these take full advantage of both spatial and temporal locality, paving the way for fast vector operations and memory reuse. By using this system on a Resnet50 model, we were able to reduce the weight by half with minimal accuracy loss, resulting in a two-times faster inference speed. We will present performance estimates using accurate and complete code generation for AIE2 configuration sets (AMD Versal FPGAs) with Resnet50, Inception V3, and VGG16 to demonstrate the necessary synergy between hardware overlay designs and software stacks for compiling and executing machine learning applications.

I Introduction

To reduce computational cost of a large-scale DNNs, pruning and quantization have been widely used [17], which reduce the complexity of neural networks by removing output filters all together, part of them, or simplifying type and number of the basic operations. Weight Block sparsity is another approach to speedup of DNN’s inference, orthogonal to pruning or quantization, and it can be used together with them.

Sparsity means that value of a subset of weights is exactly zero. If a value of a weight is zero, then the linear operation associated with the value can be skipped, since any value times zero equals zero (obviously). Therefore, the computational cost of sparse linear operations should be only proportional to the number of non-zero weights [12]. The main problem of operations using weights with arbitrary sparsity (i.e., also known as unstructured) is that they cannot be efficiently implemented on contemporary hardware, such as CPUs, GPUs, and NPUs. That is, they cannot use effectively vector operations. However, highly optimized block-sparse operations, with block sizes as small as 8×8888\times 88 × 8, can run efficiently on AIE2 processors, an AI hardware accelerator. Figure 1 gives a visual comparison of block-sparse weights with 8×8888\times 88 × 8 block and dense weights.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: Visualization of dense and block-sparse weight matrix, zero blocks are green without variations

Consider a convolution 𝐘=𝐖𝐗+b𝐘𝐖𝐗𝑏{\bf Y}={\bf W}*{\bf X}+bbold_Y = bold_W ∗ bold_X + italic_b: The output 𝐘𝐘{\bf Y}bold_Y is a tensor of size M×N×O𝑀𝑁𝑂M\times N\times Oitalic_M × italic_N × italic_O, the input 𝐗𝐗{\bf X}bold_X is of size L×J×I𝐿𝐽𝐼L\times J\times Iitalic_L × italic_J × italic_I, and the kernel 𝐖𝐖{\bf W}bold_W is of size O×H×K×I𝑂𝐻𝐾𝐼O\times H\times K\times Iitalic_O × italic_H × italic_K × italic_I (the bias b𝑏bitalic_b is a vector of size O𝑂Oitalic_O). We abstract the kernel dimensions one level higher into a matrix 𝐖¯i,j=𝐖i,,,jsubscript¯𝐖𝑖𝑗subscript𝐖𝑖𝑗{\bar{\bf W}}_{i,j}={\bf W}_{i,*,*,j}over¯ start_ARG bold_W end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = bold_W start_POSTSUBSCRIPT italic_i , ∗ , ∗ , italic_j end_POSTSUBSCRIPT of size O×I𝑂𝐼O\times Iitalic_O × italic_I where we hide the kernel dimensions H×K𝐻𝐾H\times Kitalic_H × italic_K (think them as going inside the paper, Figure 2). We describe the granularity of a block by a pair bo×bi[8×8,4×16,16×4]subscript𝑏𝑜subscript𝑏𝑖88416164b_{o}\times b_{i}\in[8\times 8,4\times 16,16\times 4]italic_b start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT × italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 8 × 8 , 4 × 16 , 16 × 4 ]. This particular granularity set is based on AIE2, but our approach can be extended to any computational units. Of course, the choice of granularity makes a difference in execution time in more than one way. The block sparsity is a process of introducing zeros into rectangular blocks in 𝐖¯¯𝐖{\bar{\bf W}}over¯ start_ARG bold_W end_ARG and we identify such a process as

Γ(𝐖¯,bo×bi),ι=0 s.t. 𝐖i,,,j=0 with i[bo,(+1)bo1],j[ιbi,(ι+1)bi1]formulae-sequenceΓsubscript¯𝐖subscript𝑏𝑜subscript𝑏𝑖𝜄0 s.t. subscript𝐖𝑖𝑗0 with 𝑖subscript𝑏𝑜1subscript𝑏𝑜1𝑗𝜄subscript𝑏𝑖𝜄1subscript𝑏𝑖1\begin{split}\Gamma({\bar{\bf W}},b_{o}\times b_{i})_{\ell,\iota}&=0\\ &\mbox{ s.t. }{\bf W}_{i,*,*,j}=0\mbox{ with }\\ &i\in[\ell*b_{o},(\ell+1)*b_{o}-1],\\ &j\in[\iota*b_{i},(\iota+1)*b_{i}-1]\\ \end{split}start_ROW start_CELL roman_Γ ( over¯ start_ARG bold_W end_ARG , italic_b start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT × italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT roman_ℓ , italic_ι end_POSTSUBSCRIPT end_CELL start_CELL = 0 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL s.t. bold_W start_POSTSUBSCRIPT italic_i , ∗ , ∗ , italic_j end_POSTSUBSCRIPT = 0 with end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_i ∈ [ roman_ℓ ∗ italic_b start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT , ( roman_ℓ + 1 ) ∗ italic_b start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - 1 ] , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_j ∈ [ italic_ι ∗ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ( italic_ι + 1 ) ∗ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 ] end_CELL end_ROW (1)

The block Γ(𝐖¯,4×16)1,2Γsubscript¯𝐖41612\Gamma({\bar{\bf W}},4\times 16)_{1,2}roman_Γ ( over¯ start_ARG bold_W end_ARG , 4 × 16 ) start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT is zero means 𝐖i,,,jsubscript𝐖𝑖𝑗{\bf W}_{i,*,*,j}bold_W start_POSTSUBSCRIPT italic_i , ∗ , ∗ , italic_j end_POSTSUBSCRIPT is zero with i[14,241]𝑖14241i\in[1*4,2*4-1]italic_i ∈ [ 1 ∗ 4 , 2 ∗ 4 - 1 ] and j[216,3161]𝑗2163161j\in[2*16,3*16-1]italic_j ∈ [ 2 ∗ 16 , 3 ∗ 16 - 1 ]. In practice, ΓΓ\Gammaroman_Γ is a bit map or a mask (i.e., 0 zeros, 1 unaffected but it could be used as weighted mask with values in [0,1]01[0,1][ 0 , 1 ]). We can express block sparsity percentage as the relative number of zeros in ΓΓ\Gammaroman_Γ and in particular the number of zeros in every row of ΓΓ\Gammaroman_Γ, where N𝑁Nitalic_N is the number of columns in ΓΓ\Gammaroman_Γ and for every row of ΓΓ\Gammaroman_Γ:

𝒮=1N(NiNΓ(𝐖¯,O×I)o,i)𝒮1𝑁𝑁superscriptsubscript𝑖𝑁Γsubscript¯𝐖𝑂𝐼𝑜𝑖{\cal S}=\frac{1}{N}(N-\sum_{i}^{N}\Gamma({\bar{\bf W}},O\times I)_{o,i})caligraphic_S = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ( italic_N - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Γ ( over¯ start_ARG bold_W end_ARG , italic_O × italic_I ) start_POSTSUBSCRIPT italic_o , italic_i end_POSTSUBSCRIPT ) (2)

In Figure 2, we show an example where we zero one block per row and the block granularity is 8×8888\times 88 × 8.

Refer to caption

Figure 2: Example of block sparsity Γ(𝐖¯,8×8)Γ¯𝐖88\Gamma({\bar{\bf W}},8\times 8)roman_Γ ( over¯ start_ARG bold_W end_ARG , 8 × 8 ), 𝐖¯¯𝐖{\bar{\bf W}}over¯ start_ARG bold_W end_ARG, and 𝐖𝐖{\bf W}bold_W

.

The main advantages of block sparsity are:

  1. 1.

    The non-zero blocks are dense in the inner dimensions (input channels) and we can tune the granularity. We exploit spatial locality by reading in one cycle 4 elements and we read a multiple contiguous in memory. This is spatial locality and it must be exploited by long-row memories and long-line caches. In turn, it is highly used by vectorizable functional units that consume and produce multiple results in a single clock.

  2. 2.

    The block are dense in output dimension so the same weight volume can be reused for multiple outputs; that is, better throughput and possibly latency.

  3. 3.

    The zeros blocks are easily skipped: 50% sparsity results in 50% fewer operations with the identical throughput and latency.

  4. 4.

    The computation is naturally split by block of filters, exploiting (output) parallelism, and each sub-computation has a known and countable number of operations per row and per block.

Block Sparsity would be beneficial for any hardware architecture with vectorizable instructions (spatial locality and parallelism) and multiple computation units (thread parallelism). We exploit block sparsity in terms of training neural network, its compilation, and implementation on a certain hardware. The main contributions of this work are:

  1. 1.

    We proposed Weight Block Sparsity method on a pre-trained neural network, which is hardware-friendly structured method, and can be applied any CPU, GPU, and NPU (Neural processing Unit) and reduce the memory footprint of a deep neural network and accelerate its inference.

  2. 2.

    Also, we proposed the compiler and code generation for the neural network with Block Sparsity. Our Compiler is based on parameterized representation of block sparsity, allows the capability to split tensors and computations accordingly to a parameterized representation of the architecture.

  3. 3.

    Lastly, we could estimate the time execution for operations of a neural network including DDR to mentile communications and graph computation. Sparsity reduces computation and communication time, and is captured using the time estimation.

II Related Works

The modern reference of sparsity in deep learning is by Hoefler et al [13], which is an entire spectrum of sparsity and is not limited to block sparsity. The original sparsity comes from applications, for example [23, 8, 24]. Data structures and algorithms are tailored to exploit sparsity. Often, the data structure is tailored to the sparsity in order to compact data, different sparsity structures require different algorithms, and achieve different performance. Block sparsity in this context is for matrices (i.e., often diagonal matrices) where a block is stored contiguously in memory. Siswanto in [25] introduces block sparsity in convolutional network generalizing for matrices (i.e., fully connected layers) but the application to convolution layers is different than ours (the author does pruning of the kernel by height or width). Our block sparsity describes logically a block but not in memory layout, in most framework the input channel is the innermost dimensions. The zeros are shuffled in, as for matrices, we shall compact the blocks.

The works by Numenta in [1, 16] go into two directions: block sparsity of the convolution kernels and of activation tensors; that is, input and output. For the kernel, they can choose any granularity and every filter can be sparsified so that a jig-saw puzzle can be created: each filter is block sparse with different patterns so that, across filters, they can further compact the data structure. The sparsification of the activation tensor does not have a pattern; it is called unstructured sparsity. The structured sparsity can be achieved only by a carefully process that must be part of the network training. It is not clear from the literature how the training is carried on.

Nvidia implemented 50% fine-grained sparsity on Ampere architecture [21]. Nvidia A100 supports convolution and matrix multiply operations in 2:4 pattern, 2 values are zero out of each contiguous value of 4. The sparsity network achieved 1.3x to 1.5x in speedup. The same functionality is available in AMD GPUs and (future) AIE2s instruction sets. This sparsity is appealing because of its hardware support, deterministic space footprint, and computation reduction. However, it imposes very stringent constraints on the type of sparsity not always allowing a comparable accuracy.

The difficulty of generating code for complex architectures can be described and solved in different ways. There are recent attempts of introducing Software/Hardware solution for spatial processors [15, 22, 4]. However, their major attentions are given only to matrix multiplication and GPUs [11] [19]. In our case, we are focusing on the so called static sparsity or weight sparsity.

III Block Sparsity in a Matrix

Block sparsity is an intuitive concept but it can be misunderstood. Take a matrix multiplication in Equation 3

(C0C1C2C3)=(A0A1A2A3)(0B1B20)matrixsubscript𝐶0subscript𝐶1subscript𝐶2subscript𝐶3matrixsubscript𝐴0subscript𝐴1subscript𝐴2subscript𝐴3matrix0subscript𝐵1subscript𝐵20\begin{pmatrix}C_{0}&C_{1}\\ C_{2}&C_{3}\end{pmatrix}=\begin{pmatrix}A_{0}&A_{1}\\ A_{2}&A_{3}\\ \end{pmatrix}\\ \begin{pmatrix}0&B_{1}\\ B_{2}&0\\ \end{pmatrix}\\ ( start_ARG start_ROW start_CELL italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL 0 end_CELL start_CELL italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW end_ARG ) (3)

This is the computation

C0=A1B2;C1=A0B1;C2=A3B2;C3=A2B1formulae-sequencesubscript𝐶0subscript𝐴1subscript𝐵2formulae-sequencesubscript𝐶1subscript𝐴0subscript𝐵1formulae-sequencesubscript𝐶2subscript𝐴3subscript𝐵2subscript𝐶3subscript𝐴2subscript𝐵1C_{0}=A_{1}B_{2};\;C_{1}=A_{0}B_{1};\;C_{2}=A_{3}B_{2};\;C_{3}=A_{2}B_{1}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (4)

and in general with proper γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (i.e., a mask)

Ci=k=01Ai+k(γ2k+iB2k+i)subscript𝐶𝑖superscriptsubscript𝑘01subscript𝐴𝑖𝑘subscript𝛾2𝑘𝑖subscript𝐵2𝑘𝑖C_{i}=\sum_{k=0}^{1}A_{i+k}\big{(}\gamma_{2*k+i}B_{2*k+i}\big{)}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i + italic_k end_POSTSUBSCRIPT ( italic_γ start_POSTSUBSCRIPT 2 ∗ italic_k + italic_i end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 ∗ italic_k + italic_i end_POSTSUBSCRIPT ) (5)

Where the matrix B𝐵Bitalic_B is constant, diagonal, and each submatrix B2subscript𝐵2B_{2}italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and B1subscript𝐵1B_{1}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT can split further down and may have even smaller zero blocks. For data structure in the sparse computation field, we can use compress block row (CBR) or column format (CBC) or generalization of the coordinate format (COO). There are standard matrix sparse-matrix multiplication interfaces and algorithms for CPU and GPUs using this data format (where only one operand is sparse or both) [2, 20]. In the introduction section, we describe block sparsity applied to convolutions. We store a compact ΓΓ\Gammaroman_Γ for inference, this could be literally a bitmap or an block row/column index. For the effective computation by the AIE kernel, we deploy a relative row/col index that will apply to dense operations as well. The figurative sparsity is now expressed in a computation where the savings are clear.

This is briefly our overall process and contributions. We explore training techniques using PyTorch and Keras in Section IV. We compute a mask ΓΓ\Gammaroman_Γ of zeros/ones per layer by zeroing the more likely blocks (using a Norm). Then we train the model till convergence or accuracy are achieved. We take the sparse model and we quantize to 8-bit integer computations with a quantizer which is based on round-to-nearest quantization (Equation 12). The final model is an intermediate representation. We have a custom compiler that takes the intermediate representation and an abstraction of a connected set of AIE2 (i.e., Section V-A and V). The compiler computes the maximum sub-volume computation per core. By heuristics and following a schedule, it computes a memory allocation in Memtile (i.e., intermediate scratch pad Section V-C) for inputs, outputs, and weights . It formats, compresses, and organizes the weights exploiting spatial distribution to Memtiles and cores. We generate all the explicit communications between DDR (L3subscript𝐿3L_{3}italic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT), Memtile (L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), and cores (L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT). These are Gather and Scatter instructions with a complete and parametric estimate of their execution time by bandwidth constraints and number of channels. The calculation is based on the sub-volume sizes per core, the computation throughput and with a clear specification of what is executed in parallel. The data movement codes and AIE core codes for simple convolutions were simulated and run in hardware for simple layers (i.e., Section V-D). Thus, we can achieve consistent execution time estimates per layer and of the entire network with an accuracy closer to a simulation (realistic although optimistic Section V-E). We will show estimates for three CNN models and eight different AIE designs; see Section VII. To our knowledge, we are the first in applying sparsity on AIEs systematically. The ability to provide different algorithms and easy to obtain estimates for very different configurations, it will allow to explore optimizations like sub-graph depth-wise tiling we could not have otherwise.

In our context, convolution is our main computation and CNN are networks we can train reasonably. This is because of legacy, we want to speed up the FPGA work-horses, convolutions provide more difficulties than GEMMs (padding, strides, and overlap), have usually biases with different precision requirements (weight 8bits and bias 32), routinely they can deploy different scaling factors per output channels, and GEMM is transformed into a 1×1111\times 11 × 1 convolution immediately. The other way around is possible with proper activation preparation; in fact, a convolution can be represented as a GEMM but the activation tensors explode in space by a factor of the kernel size and the preparation is more suitable for CPUs than FPGAs.

III-A Block-Sparse Matrix-Matrix Multiplication

Consider ΓΓ\Gammaroman_Γ and ΩΩ\Omegaroman_Ω two matrices in {0,1}N×Nsuperscript01𝑁𝑁\{0,1\}^{N\times N}{ 0 , 1 } start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT.

C=(ΓA)(ΩB)t𝐶Γ𝐴superscriptΩ𝐵𝑡C=(\Gamma A)*(\Omega B)^{t}italic_C = ( roman_Γ italic_A ) ∗ ( roman_Ω italic_B ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (6)

More precisely, consider non-zero blocks of size k×k𝑘𝑘k\times kitalic_k × italic_k so that

CiN+j=k(γiN+kAiN+k)(ω˙jN+kB˙jN+k)subscript𝐶𝑖𝑁𝑗subscript𝑘subscript𝛾𝑖𝑁𝑘subscript𝐴𝑖𝑁𝑘subscript˙𝜔𝑗𝑁𝑘subscript˙𝐵𝑗𝑁𝑘C_{i*N+j}=\sum_{k}(\gamma_{i*N+k}A_{i*N+k})(\dot{\omega}_{j*N+k}\dot{B}_{j*N+k})italic_C start_POSTSUBSCRIPT italic_i ∗ italic_N + italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_γ start_POSTSUBSCRIPT italic_i ∗ italic_N + italic_k end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i ∗ italic_N + italic_k end_POSTSUBSCRIPT ) ( over˙ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_j ∗ italic_N + italic_k end_POSTSUBSCRIPT over˙ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_j ∗ italic_N + italic_k end_POSTSUBSCRIPT ) (7)

Thanks to the sparsity and if we store only non-zeros, then γiN+ksubscript𝛾𝑖𝑁𝑘\gamma_{i*N+k}italic_γ start_POSTSUBSCRIPT italic_i ∗ italic_N + italic_k end_POSTSUBSCRIPT and ω˙jN+ksubscript˙𝜔𝑗𝑁𝑘\dot{\omega}_{j*N+k}over˙ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_j ∗ italic_N + italic_k end_POSTSUBSCRIPT are contiguous. There will be a meaningful product to compute if and only if γiN+k=1subscript𝛾𝑖𝑁𝑘1\gamma_{i*N+k}=1italic_γ start_POSTSUBSCRIPT italic_i ∗ italic_N + italic_k end_POSTSUBSCRIPT = 1 and ω˙jN+k=1subscript˙𝜔𝑗𝑁𝑘1\dot{\omega}_{j*N+k}=1over˙ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_j ∗ italic_N + italic_k end_POSTSUBSCRIPT = 1. We merge-sort these vectors. See how the sparse sparse matrix multiplication using Coordinate Block Structure (COO) is applied in Figure 3. We provide software to reproduce this [7].

In the case of achieving a fixed sparsity (i.e., density) of 50% for a square matrix of size N𝑁Nitalic_N and we choose the block size k×k𝑘𝑘k\times kitalic_k × italic_k. The larger k𝑘kitalic_k is, the smaller the overhead will be. The relative performance of the k3superscript𝑘3k^{3}italic_k start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT multiplication is better as k𝑘kitalic_k get larger because spatial, temporal locality, and further optimized code for a constant/parameterized such as k𝑘kitalic_k.

Refer to caption
(a)
Refer to caption
(b)
Figure 3: Block 1x1 and 8x8 performance

In Figure 3, we present two scatter plots: on the abscissa the effective multiplication-and-addition number, on the ordinate the performance in GFLOPS, when the sparse matrix with dense block is 1×1111\times 11 × 1 (above) and 8×8888\times 88 × 8 (below). Given the same task, we deploy more threads and thus the vertical effect (AMD 16 cores Threadripper). With the same number of effective operations, the block permits and exploits higher GFLOPS per effective operation (Float is 2x faster than Double precision and this can be emphasized further [11, 19] and [18], for AIE2 we can do 256 multiply accumulate per cycle using 8 bit operands, 128 per 16 bits, 64 per 32bits).

IV Block Sparsity: Training and Quantization

The block sparsity we plan to deploy is not naturally recurring in Convolutional Neural Networks. Training is required and achievable practically as fine tuning of trained networks.

As reminder, a convolution has a weight tensor in four dimension: WRcout×h×k×cin𝑊superscript𝑅subscript𝑐𝑜𝑢𝑡𝑘subscript𝑐𝑖𝑛W\in R^{c_{out}\times h\times k\times c_{in}}italic_W ∈ italic_R start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT × italic_h × italic_k × italic_c start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. In the hyperplane of the hhitalic_h and k𝑘kitalic_k, we can simplify the weight as W˙Rcout×cin˙𝑊superscript𝑅subscript𝑐𝑜𝑢𝑡subscript𝑐𝑖𝑛\dot{W}\in R^{c_{out}\times c_{in}}over˙ start_ARG italic_W end_ARG ∈ italic_R start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT × italic_c start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and block sparsity can be simply described by a mask ΓW˙Γ˙𝑊\Gamma\dot{W}roman_Γ over˙ start_ARG italic_W end_ARG. Although, we speak of a 8×8888\times 88 × 8 of non-zeros, this is in practice a 8×h×k×88𝑘88\times h\times k\times 88 × italic_h × italic_k × 8 block. For the matrix multiply h=k=1𝑘1h=k=1italic_h = italic_k = 1, there is no difference from the unstructured sparsity. We explain the training process.

IV-A Searching Optimum Sparsity ratio

We target convolutions first and without quantization. We take a model and we create a copy where we enhance the convolution with a (non-trainable) ΓΓ\Gammaroman_Γ. ΓΓ\Gammaroman_Γ is the sparsity ratio and mask. This experiment is conducted using Keras[5] and the code is available [6]. A convolution will have three parameters (saving the model into a different format). The forward computation is modified so that the weights used for convolution are ΓWΓ𝑊\Gamma{W}roman_Γ italic_W. We assume the backward computation (i.e., gradient) is done automatically from the forward definition. There is no need to change the bias. For example, we take Resnet50 from the Keras application repository, we start with a Γ=1Γ1\Gamma=1roman_Γ = 1, and we trained one epoch using Imagenet dataset [9]. The goal is to choose ΓΓ\Gammaroman_Γ in such a way we achieve the required sparsity and the minimum loss in accuracy. We start from W0subscript𝑊0W_{0}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the current optimal weight, then choose values in W0subscript𝑊0W_{0}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to make zeros using different approaches such as incremental, Fisher measure, Hessian, diagonal Hessian, and custom penalty losses.

IV-B Sparsity Ratio as Incremental

For example, for every convolution, ni=γisubscript𝑛𝑖subscript𝛾𝑖n_{i}=\sum\gamma_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which is the number of blocks, reduce this number to ni2subscript𝑛𝑖2\frac{n_{i}}{2}divide start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG. For each epoch (say every two training epochs), we consider the current non-set-yet mask elements (1γi)=k<ni21subscript𝛾𝑖𝑘subscript𝑛𝑖2\sum(1-\gamma_{i})=k<\frac{n_{i}}{2}∑ ( 1 - italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_k < divide start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG. We compute our importance measure for all in ascending order. This time, we zero the first min(5100k,1)5100𝑘1\min(\frac{5}{100}k,1)roman_min ( divide start_ARG 5 end_ARG start_ARG 100 end_ARG italic_k , 1 ). We keep this process until we reach 50% sparsity. At each iteration at least one block will be set to zero. We trace a solution path as much as geodesic as possible.

IV-C Sparsity Ratio as Trainable as Optimization Problem

If we want to make ΓΓ\Gammaroman_Γ part of the optimization process as trainable variable we could introduce a penalty function into the loss (x,w)+λ(w)𝑥𝑤𝜆𝑤{\bf\ell}(x,w)+\lambda(w)roman_ℓ ( italic_x , italic_w ) + italic_λ ( italic_w ). First let us introduce an approximation for the max(x)𝑥\max(x)roman_max ( italic_x ), so when in this section you will read max,min\max,\minroman_max , roman_min, this is a log sum exponential, which is continuous, derivable everywhere, and convex:

max(x)=LSE(x,α)=1αlogexiα𝑥𝐿𝑆𝐸𝑥𝛼1𝛼superscript𝑒subscript𝑥𝑖𝛼\max(x)=LSE(x,\alpha)=\frac{1}{\alpha}\log\sum e^{x_{i}*\alpha}roman_max ( italic_x ) = italic_L italic_S italic_E ( italic_x , italic_α ) = divide start_ARG 1 end_ARG start_ARG italic_α end_ARG roman_log ∑ italic_e start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∗ italic_α end_POSTSUPERSCRIPT (8)

With T𝑇Titalic_T we represent the number of non-zero block in ΓΓ\Gammaroman_Γ

λ=𝜆absent\displaystyle\lambda=italic_λ = (max(Γ)min(Γ))ΓΓ\displaystyle-(\max(\Gamma)-\min(\Gamma))- ( roman_max ( roman_Γ ) - roman_min ( roman_Γ ) ) (9)
+βL2(ΓT)+ιL1(Γ)𝛽𝐿2Γ𝑇𝜄𝐿1Γ\displaystyle+\beta*L2(\Gamma-T)+\iota*L1(\Gamma)+ italic_β ∗ italic_L 2 ( roman_Γ - italic_T ) + italic_ι ∗ italic_L 1 ( roman_Γ )

This is a simplified loss so that we minimize the value of ΓΓ\Gammaroman_Γ but also try to maximize the difference of the elements.

λ=𝜆absent\displaystyle\lambda=italic_λ = max(Γ,0)+max(Γ1,0)(max(Γ)min(Γ))Γ0Γ10ΓΓ\displaystyle\max(-\Gamma,0)+\max(\Gamma-1,0)-(\max(\Gamma)-\min(\Gamma))roman_max ( - roman_Γ , 0 ) + roman_max ( roman_Γ - 1 , 0 ) - ( roman_max ( roman_Γ ) - roman_min ( roman_Γ ) ) (10)
+βL2(ΓT)+ιL1(Γ)𝛽𝐿2Γ𝑇𝜄𝐿1Γ\displaystyle+\beta*L2(\Gamma-T)+\iota*L1(\Gamma)+ italic_β ∗ italic_L 2 ( roman_Γ - italic_T ) + italic_ι ∗ italic_L 1 ( roman_Γ )

This last penalty function represents our attempt to state that the γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT should be positive and in the interval [0,1]01[0,1][ 0 , 1 ] and in a such a way that we maximize the distance of the elements between 0s and 1s.

λ=𝜆absent\displaystyle\lambda=italic_λ = max(Γ,0)+max(Γ1,0)min(Γ)max(Γ)Γ0Γ10ΓΓ\displaystyle\max(-\Gamma,0)+\max(\Gamma-1,0)-\frac{\min(\Gamma)}{\max(\Gamma)}roman_max ( - roman_Γ , 0 ) + roman_max ( roman_Γ - 1 , 0 ) - divide start_ARG roman_min ( roman_Γ ) end_ARG start_ARG roman_max ( roman_Γ ) end_ARG (11)
+βL2(ΓT)+ιL1(Γ)𝛽subscript𝐿2Γ𝑇𝜄subscript𝐿1Γ\displaystyle+\beta*L_{2}(\Gamma-T)+\iota*L_{1}(\Gamma)+ italic_β ∗ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Γ - italic_T ) + italic_ι ∗ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( roman_Γ )

IV-D Hessian and Fisher Information

If we have N𝑁Nitalic_N parameters/weights, the Hessian HRN×N𝐻superscript𝑅𝑁𝑁H\in R^{N\times N}italic_H ∈ italic_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT has quite the space complexity (consider even small models could have million parameters). When we are already close to the optimal solution or we are trying to move from the optimal solution without using information from the gradient, the Hessian provides the most information close by an already established solution point. There are also ways to compute the Hessian and the effects of the Hessian by using Fisher information [26, 27, 28]. This will reduce to the computation of the gradient of the loss function.

IV-E Diagonal Hessian

We applied a Fisher measure and computed 2superscript2\nabla^{2}{\bf\ell}∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ, that is computed just the diagonal of the Hessian. Again, we use the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT over the normalized weight and went through the process of training. The elements of the diagonal are not representative in general, but they are a good approximation of the contribution of a single weight. The Fisher and 2superscript2\nabla^{2}{\bf\ell}∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ did not provide any main advantages. But this information is very useful in quantization and optimizations within the same field and application.

IV-F Predetermined Sparsity ratio and Full Training Ahead

Take a convolution with Γ=1Γ1\Gamma=1roman_Γ = 1 and weights W𝑊Witalic_W. Once again for each γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, this will be representative of a block WiR8×h×w×8R8×8subscript𝑊𝑖superscript𝑅8𝑤8similar-tosuperscript𝑅88W_{i}\in R^{8\times h\times w\times 8}\sim R^{8\times 8}italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT 8 × italic_h × italic_w × 8 end_POSTSUPERSCRIPT ∼ italic_R start_POSTSUPERSCRIPT 8 × 8 end_POSTSUPERSCRIPT. We can choose the Wisubscript𝑊𝑖W_{i}italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using a measure of importance:

  • L2(Wi)=kwk2subscript𝐿2subscript𝑊𝑖subscript𝑘superscriptsubscript𝑤𝑘2L_{2}(W_{i})=\sqrt{\sum_{k}w_{k}^{2}}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG with wkWisubscript𝑤𝑘subscript𝑊𝑖w_{k}\in W_{i}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT,

  • L1(Wi)=k|wk|subscript𝐿1subscript𝑊𝑖subscript𝑘subscript𝑤𝑘L_{1}(W_{i})=\sum_{k}|w_{k}|italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | as above,

  • Variance σ2=164k(wkμ)2superscript𝜎2164subscript𝑘superscriptsubscript𝑤𝑘𝜇2\sigma^{2}=\frac{1}{64}\sum_{k}(w_{k}-\mu)^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 64 end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT with μ=164wk𝜇164subscript𝑤𝑘\mu=\frac{1}{64}\sum w_{k}italic_μ = divide start_ARG 1 end_ARG start_ARG 64 end_ARG ∑ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, with wkWisubscript𝑤𝑘subscript𝑊𝑖w_{k}\in W_{i}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT oder 1Nwk1𝑁subscript𝑤𝑘\frac{1}{N}\sum w_{k}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, with wkWsubscript𝑤𝑘𝑊w_{k}\in Witalic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_W.

We can then sort them in ascending order. We set the first half to zero. Then we start re-training. We do this for the entire network or for one convolution at a time.

In Table I, we show the results by using one-time mask and full training: VGG-16, ResNet-50, Inceptionv3 on ImageNet20 (20 classes) and ImageNet1k (1000 classes). We use three samples per class for the validation accuracy for ImageNet1k data set; instead, we use 50 samples per class for ImageNet20. Fine-tuning sparse networks on the original ImageNet data-set [9] is expensive. To reduce the training time, we chose 20 classes (from the original 1000 classes) with the least number of images per class in the training data-set and this choice will affect the accuracy because there are fewer samples for re-training.

TABLE I: Accuracies of the sparsity models
Model Dataset Baseline Sparsity
Acc.(%) block ratio (%) Acc.(%)
Inception-v3 ImageNet1k 77.2 8x8 50 75.5
ResNet-50 ImageNet1k 76.7 8x8 50 74.6
VGG-16 ImageNet1k 70.6 8x8 50 69.7
ResNet-50 ImageNet20 96.1 8x8 25 95.1
ResNet-50 ImageNet20 96.1 8x8 50 92.0
ResNet-50 ImageNet20 96.1 8x8 75 87.1
ResNet-50 ImageNet20 96.1 1x1 25 96.0
ResNet-50 ImageNet20 96.1 1x1 50 95.6
ResNet-50 ImageNet20 96.1 1x1 75 93.5
VGG-16 ImageNet20 92.0 8x8 50 89.6
VGG-16 ImageNet20 92.0 1x1 50 92.3
VGG-16 ImageNet20 92.0 1x1 75 91.7

Classification accuracy on ImageNet1k drops by only 1-2% after applying 50% sparsity with a 8×8888\times 88 × 8 block (this is without any quantization). We experiment with different block shapes such as 16×416416\times 416 × 4 and 4×164164\times 164 × 16 on ResNet-50, but the accuracy is slightly worse. The different shape has the same volume and either more parallel vector operations or longer vector computations. This last process based on PyTorch has been the most accurate while the other approaches had a drop of 7 points in accuracy at least. It is not clear if it is a methodology weakness or just bad execution. It is important to us to present all the tools attempted.

Fine-grained sparsity (1×1111\times 11 × 1 block or unstructured) does not sacrifice any accuracy (i.e., almost any). This is not equivalent to 2 over 4 (or 4 over 8) sparsity now available in GPUs.

V Compiler and its Code generation

We take a PyTorch/Keras model and quantize it before creating an intermediate representation. Consider a weight W𝑊Witalic_W. The linear operation can be written as y=Wx𝑦𝑊𝑥y=Wxitalic_y = italic_W italic_x, and the quantized one is y=Wqx𝑦subscript𝑊𝑞𝑥y=W_{q}xitalic_y = italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_x. The quantization function is defined as:

Wq=ΔRound(W/Δ) with Δ=max(|W|)2N1subscript𝑊𝑞Δ𝑅𝑜𝑢𝑛𝑑𝑊Δ with Δ𝑚𝑎𝑥𝑊superscript2𝑁1W_{q}=\Delta*Round(W/\Delta)\mbox{ with }\Delta=\frac{max(|W|)}{2^{N-1}}italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = roman_Δ ∗ italic_R italic_o italic_u italic_n italic_d ( italic_W / roman_Δ ) with roman_Δ = divide start_ARG italic_m italic_a italic_x ( | italic_W | ) end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT end_ARG (12)

Our intermediate representation is a graph where each node is an operation that reads tensors and writes one tensor. A convolution has one quantized input INT8 with a position where the fraction starts (power of two scale). It computes a tensor using the same layout and with a proper scale. The weights and bias are properties of the convolutions. They are tailored and laid out at compile time, they are COUT×h×w×CINsubscript𝐶𝑂𝑈𝑇𝑤subscript𝐶𝐼𝑁C_{OUT}\times h\times w\times C_{IN}italic_C start_POSTSUBSCRIPT italic_O italic_U italic_T end_POSTSUBSCRIPT × italic_h × italic_w × italic_C start_POSTSUBSCRIPT italic_I italic_N end_POSTSUBSCRIPT

The compiler in this work can create the parameterized representation of block sparsity, and has the capability to split tensors and computations accordingly to a parameterized representation of the architecture. Our Hardware abstraction is a Python class, describing a variety of systems. All weights are statically prepared into DDR and we move them explicitly towards the inner levels. Inputs and outputs have designated space in DDR. DDR can and it will be used for tensors spills. The memory allocation to Memtile is basically coloring algorithms and some heuristics. In this architecture, we do not allow streaming of neither data nor weights (because they share space in Memtile and input and output have different consumption/production rates).

V-A Hardware Abstraction

Refer to caption
(a)
Figure 4: 4x4 AIE representation

As shown in Figure 4 as an example, we work with a mesh of 4x4 tensor cores connected by 4 horizontal and 4 vertical interconnections. We present estimates for square 2x2, .. i×i𝑖𝑖i\times iitalic_i × italic_i .. 8x8 and rectangular shapes are in the works (4×1414\times 14 × 1, 4×2424\times 24 × 2, and 8×2828\times 28 × 2 into a 8×8888\times 88 × 8 with 2 Memtiles per column). Each core has 8 banks memories for a total 64 KB. About six banks are used as input/output/weight buffers and two banks are used as temporary space for kernels. Each core can request and send data to its direct neighbors (if aware of connection and control but this utility is not used here). Double buffering using ping/pong is used for inputs and outputs.

There are four Memtiles: each 512 KB and each is connected to one columns and its direct neighbor column, or it is connected to a row and its neighbor. The total amount of space is 2 MB. Memtile is a circular buffer to exploit more flexible allocation. Note a 2×2222\times 22 × 2 architecture will have one Memtile per column and a total of two Memtiles (1 MB).

A Memtile can broadcast data per column or per row; it is a design choice. We can dedicate one Memtile for weights, one for activations, or we can share it. In this work, we present results for shared Memtiles. To maximize the computation parallelism, every core will write data per column into Memtile.

V-B Subvolumes, Data Compression, and Data Movements

The computation is split by Memtile and thus by column (cores columns). The output tensor is computed and split evenly by width. Thus one Memtile will store one partial tensor by width, each core will compute different output channels, and the computation streams the output tensor by rows and using ping/pong double buffering. We prioritize to reuse weights in core. The cores set is a cohort and we always choose symmetric computations.

If we have the inputs, output, and weights in Memtile, the minimum computation is one output channel and one row (i.e, by height). If this is not possible, we try to reduce the size of the width (e.g., shaping the tensor in Memtile by using DDR spills) and we can manage to split the input channels and to split the weights accordingly and prepare for accumulation. We call W-Split the distribution of tensor by columns in the Tensor mesh. We can COUTsubscript𝐶𝑂𝑈𝑇C_{OUT}italic_C start_POSTSUBSCRIPT italic_O italic_U italic_T end_POSTSUBSCRIPT-split, this requires the partial transfer of weights. We can CINsubscript𝐶𝐼𝑁C_{IN}italic_C start_POSTSUBSCRIPT italic_I italic_N end_POSTSUBSCRIPT-split when we need to split by input channel, this is the last resort because it is also the most expensive accumulation of the outputs. CINsubscript𝐶𝐼𝑁C_{IN}italic_C start_POSTSUBSCRIPT italic_I italic_N end_POSTSUBSCRIPT-split can be implemented as a graph optimization by splitting the convolution into two and then use an element wise operation to combine the results, which can be done recursively.

The subvolume describes the smallest shape of the weights that we need to manage and the largest computation in the core. We compress the weight accordingly. Any data movement will always be a multiple of the subvolume, a multiple of 8×8888\times 88 × 8, and it is a single load. Such a compressed data will have the same properties whether it is sparse or dense.

V-C Schedule and Memory Allocation

During the scheduling of each layer, we evaluate what tensors can fit in Memtile. Here, activation and weight tensors share the space. At each step, the memory allocation will check if we can allocate inputs, weights, and outputs. If we cannot, we evict all tensors into DDR and then split the time of the computation.

At the end of this stage, every tensor will have an address in Memtile or DDR (or both). If there are only DDR addresses, the compiler will take the basic layer computation and, by heuristics, will split the computation and the output tensor by width, output channel, height, and input channel (no necessarily in this order). The heuristics have a single objective to find the largest problem fitting the (each) memory level. We deploy a recursive approach of tiling. Formally, ˙˙\dot{\sum}over˙ start_ARG ∑ end_ARG is a parallel loop and a W-split can be written as follows:

Y=Conv(X,W)=˙wConv(Xw,W)𝑌𝐶𝑜𝑛𝑣𝑋𝑊subscript˙𝑤𝐶𝑜𝑛𝑣subscript𝑋𝑤𝑊Y=Conv(X,W)=\dot{\sum}_{w}Conv(X_{w},W)italic_Y = italic_C italic_o italic_n italic_v ( italic_X , italic_W ) = over˙ start_ARG ∑ end_ARG start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_C italic_o italic_n italic_v ( italic_X start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_W ) (13)

The split is a function of the footprint. Before and after each convolution, there will be an explicit data movement as option. At this stage each input, output, and weights have addresses associated with each sub-computation. Then the code generation of each Conv(Xw,W)𝐶𝑜𝑛𝑣subscript𝑋𝑤𝑊Conv(X_{w},W)italic_C italic_o italic_n italic_v ( italic_X start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_W ) is independent and recursive as needed. This is a tree-like structure. If the convolution has strides or a large kernel, each sub-convolution has overlap data; however, the sub-convolution has defined addresses and data movements. For a W-split such as this, we are computing the output by rows and the weights are reused (read once).

V-D Code Generation

The compiler creates a list of operations. These operations become smaller and smaller and then can be executed from Memtile to Memtile. There is a further decomposition using only the Tensor cores and it is completely determined by the subvolume. Here, we show how we generate code at this level and estimate time as in Figure 5. This is the computation of a convolution with top/bottom padding by height:

Yheight=Conv(Xh=0)+˙˙h=19Conv(Xh)+˙Conv(Xh=10)subscript𝑌𝑒𝑖𝑔𝑡𝐶𝑜𝑛𝑣subscript𝑋0˙superscriptsubscript˙19𝐶𝑜𝑛𝑣subscript𝑋˙𝐶𝑜𝑛𝑣subscript𝑋10Y_{height}=Conv(X_{h=0})\dot{+}\dot{\sum}_{h=1}^{9}Conv(X_{h})\dot{+}Conv(X_{h% =10})italic_Y start_POSTSUBSCRIPT italic_h italic_e italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT = italic_C italic_o italic_n italic_v ( italic_X start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT ) over˙ start_ARG + end_ARG over˙ start_ARG ∑ end_ARG start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT italic_C italic_o italic_n italic_v ( italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) over˙ start_ARG + end_ARG italic_C italic_o italic_n italic_v ( italic_X start_POSTSUBSCRIPT italic_h = 10 end_POSTSUBSCRIPT ) (14)

An important feature of the current system is the concept of iteration between Memtile and core. Using locks and chaining the locks, we write a single instruction from the prospective of a single core (as a SIMD instruction) and driving all cores at once for multiple iterations ˙h=1iConv(Xw)superscriptsubscript˙1𝑖𝐶𝑜𝑛𝑣subscript𝑋𝑤\dot{\sum}_{h=1}^{i}Conv(X_{w})over˙ start_ARG ∑ end_ARG start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_C italic_o italic_n italic_v ( italic_X start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) in Equation 14. The ASM-like code follows:

  LOADFM Lock k_0 Memtile addr core addr iter i
  CONV iteration      i
  WRITEFM Lock k_1 Memtile addr core addr iter i

There is an implicit lock (say k_x) that is used for the pong and the system cycles between locks (k_x and k_0). These three operations execute a number of iterations i and, using a ping/pong, they will load different slices of data and compute different slices of data.

Equation 14 is encoded as follows:

  ## Head top pad < 50 us First comp block
  LOADFM Lock k_0 Memtile addr_0 core addr iter 1
  CONV iteration 1
  WRITEFM Lock k_1 Memtile addr_1 core addr iter 1
  ## Body iteration > 50 us < 150 us
  ## k_0 -> k_2 -> k_4 Lock Chain
  LOADFM Lock k_2 Memtile addr_2 core addr iter 9
  CONV iteration 7
  WRITEFM Lock k_3 Memtile addr_3 core addr iter 9
  ## tail bottom pad > 150 us Last computation block
  LOADFM Lock k_4 Memtile addr_4 core addr iter 1
  CONV iteration 1
  WRITEFM Lock k_5 Memtile addr_5 core addr iter 1

We present in Figure 5 the execution estimate of this code. At this stage, we have all the information. Per layer, the code generation is a two pass process. First, we generate code for the all loads/stores. Second we combine them into chains with dependency, logically correct and as fast as possible.

Refer to caption
(a)
Figure 5: Resnet single convolution with padding for 4x4: LOAD activation from DDR to Memtile, LOADW weights from DDR to Memtile, LOADFM activation from Memtile to Tensor cores, LOADWM weights from Memtile to Tensor cores, WRITE from Memtile to DDR, WRITEFM from Tensor Cores to Memtile, COMP Computation in this case a convolution.

We could estimate the time execution without a full code generation. When we annotate time information to a load, we have assurance that the load is a complete description of the DMA communication between multiple memories and as complex as the architecture. Actually, this is literally translated to a binary executable that perform the data movement.

V-E Time Estimation

We explain how we capture the execution time and visualize it as in Figure 5. We start by the time estimates for DDR to Memtile communications. We have two communication types: activations and weights. Per Memtile there are two dedicated channels.

  • If we share activations and weights in the same Memtile, we can use one channel for activations and one for weights. Thus the loads from DDR to Memtile (LOAD and LOADW) are parallel with a bandwidth of 4 GBps. Writes from Memtile to DDR (WRITE) can use both channels (8 GBps).

  • If activations and weights go to different Memtiles (for example weights to Memtiles ’0’ and ’3’ and activations to ’1’ and ’2’), each load is parallel and 8 GBps. Writes are identical.

The Memtile connections with AIE cores are different. We assume a few channels with 4 GBps bandwidth. One Memtile can broadcast inputs to a cores column. These communications are for activations (LOADFM). One Memtile can broadcast to rows of cores, these are for weights (LOADWM). We assume that the column and row communications are parallel.

Every communication with iteration one is synchronous and sequential. The load, convolution, and store is executed one after the other and every core is independent. For synchronization and for bookkeeping, we assume that Weights communications from Memtiles to cores are synchronous and halting (LOADWM).

Every communication with iteration larger than one, we assume that load (LOADFM), computation (COMP), and store (WRITEFM) are executed in parallel and the overall execution time is the maximum of the estimated time multiplied by the number of iterations.

We estimate the execution time of a subvolume (COMP) by the number of operations divided by the maximum number of operations per cycle which is in our scenario is 4×8×8=2564882564\times 8\times 8=2564 × 8 × 8 = 256 operations per cycle and 1 GHz frequency. Sparsity reduces computation and communication time.

We do not account the wait and synchronization which are necessary to reprogram the fabric. These are very expensive running on for a few milliseconds.

V-F Sparse Convolution example

Refer to caption
(a)
Figure 6: Resnet single convolution with padding and sparsity for 4x4 Tensor Cores

We present the time estimate for a convolution with padding, dense, and with 50% sparsity, see Figure 5 and 6.

For these convolutions, the computation dominates the execution time. Sparsity cuts the execution time by half: from 200 μs𝜇𝑠\mu sitalic_μ italic_s to 130 μs𝜇𝑠\mu sitalic_μ italic_s. On one hand, there are convolutions that realize up to 2×2\times2 × performance; on the other, there are convolutions that are dominated by the reading or writing. In the latter case, sparsity helps in space saving and probably DDR tensors spilling. In principle, we could relax sparsity requirements for those convolutions that are communication bound (and restart training). In Figure 7, we provide the time estimates for Resnet50 using sparsity 50% and using 4×4444\times 44 × 4 AIE array.

Refer to caption
(a)
Figure 7: Resnet50 for 4x4 Tensor Cores with 50% sparse weights

VI Depth-wise Tiling

Take the subgraph schedule L0,L1,Ljsubscript𝐿0subscript𝐿1subscript𝐿𝑗L_{0},L_{1},\dots L_{j}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, for every i𝑖iitalic_i Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a layer that produces a single tensor Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Depth-wise tiling is the idea of tiling the output tensor Tjsubscript𝑇𝑗T_{j}italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT into tiles umsubscript𝑢𝑚u_{m}italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, then cut the computation accordingly to find the minimum corresponding input shape and size of T1subscript𝑇1T_{-1}italic_T start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT and T0subscript𝑇0T_{0}italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, even overlapping, to carry the computation to umsubscript𝑢𝑚u_{m}italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Let us define uj=1×1subscript𝑢𝑗11u_{j}=1\times 1italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 × 1, where we neglect the channels, as a sub-tensor of layer Ljsubscript𝐿𝑗L_{j}italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT; that is, a sub-vector of Tjsubscript𝑇𝑗T_{j}italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. A projection \P is a function taking u𝑢uitalic_u and projects the input sub-tensor needed to compute u𝑢uitalic_u. In reverse order, starting from Ljsubscript𝐿𝑗L_{j}italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT compute (Lj,uj)subscript𝐿𝑗subscript𝑢𝑗\P(L_{j},u_{j})¶ ( italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and propagate the projection. When a layer Lmsubscript𝐿𝑚L_{m}italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and tensor Tmsubscript𝑇𝑚T_{m}italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT feeds two (or more) layers Lxsubscript𝐿𝑥L_{x}italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and Lysubscript𝐿𝑦L_{y}italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT. Then um=max((Lx,ux),(Ly,uy))subscript𝑢𝑚subscript𝐿𝑥subscript𝑢𝑥subscript𝐿𝑦subscript𝑢𝑦u_{m}=\max(\P(L_{x},u_{x}),\P(L_{y},u_{y}))italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = roman_max ( ¶ ( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) , ¶ ( italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ) when uxsubscript𝑢𝑥u_{x}italic_u start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and uysubscript𝑢𝑦u_{y}italic_u start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT are defined completely and max mean the largest. We carry the propagation of (Lm,um)subscript𝐿𝑚subscript𝑢𝑚\P(L_{m},u_{m})¶ ( italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) and we have T1=(L0,u0)=(L0(L1,u1))=(L0,(Ll,ul)T_{-1}=\P(L_{0},u_{0})=\P(L_{0}\P(L_{1},u_{1}))=\P(L_{0},\dots\P(L_{l},u_{l})italic_T start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = ¶ ( italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = ¶ ( italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ¶ ( italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) = ¶ ( italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … ¶ ( italic_L start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ). At the end of the propagation, imagine this is a generalized convolution with kernel size k=(L0,u0)𝑘subscript𝐿0subscript𝑢0k=\P(L_{0},u_{0})italic_k = ¶ ( italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). Now if we repeat the process with uj˙=2×2˙subscript𝑢𝑗22\dot{u_{j}}=2\times 2over˙ start_ARG italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG = 2 × 2k+s=(L0,u0˙)𝑘𝑠subscript𝐿0˙subscript𝑢0k+s=\P(L_{0},\dot{u_{0}})italic_k + italic_s = ¶ ( italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over˙ start_ARG italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ). Thus we have an estimate of the generalized stride.

Now we can split the computation by selecting a non overlapping decomposition of the output tensor: say using M tiles each of size uosubscript𝑢𝑜u_{o}italic_u start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. The input is split into M tiles of size (uo1)s+ksubscript𝑢𝑜1𝑠𝑘(u_{o}-1)s+k( italic_u start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - 1 ) italic_s + italic_k and we will have an overlap of size k1𝑘1k-1italic_k - 1. We can choose a tiling in such a way there is zero DDR communication in between L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and Ljsubscript𝐿𝑗L_{j}italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. With input overlap, thus with extra reads, we have extra computations. However, we can reduce drastically the DDR communications. We use the term generalized convolution but it is not a convolution and it is only applied to a graph computation (with element wise operations and transpose convolutions, we do not think it is applicable to height and width softmax or layer norms).

VI-A Real-time analysis and Zero DDR Communication

Assume, we have an output tile size uosubscript𝑢𝑜u_{o}italic_u start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. We repeat the projection process using the schedule. This time, we keep information of the active tensors at any time in the schedule. Projections and active tensors are sufficient to represent the maximum active space requirement (not including input and output). If we want zero DDR spill for inputs and outputs, we choose a output tile for which the maximum active space is smaller than Memtile. Say we have M tiles. With an architecture for which computation is free (see [14, 3]), we write Tjsubscript𝑇𝑗T_{j}italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT once, we read T0subscript𝑇0T_{0}italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and we reread (M1)(k1)𝑀1𝑘1(M-1)(k-1)( italic_M - 1 ) ( italic_k - 1 ). This is better than l=0jTlsuperscriptsubscript𝑙0𝑗subscript𝑇𝑙\sum_{l=0}^{j}T_{l}∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT especially for large j𝑗jitalic_js.

A compiler can target easily a tiling with zero DDR communications (for layer output tensors). In practice, zero DDR communications is not our final goal.

VI-B Iteration and Memory allocation

Take the schedule L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPTLjsubscript𝐿𝑗L_{j}italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, let us apply the live tensor analysis and tensor projections for a specific tile size uosubscript𝑢𝑜u_{o}italic_u start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT and M𝑀Mitalic_M tiles. We add two custom layers as follows

Ib(M),L0,L1,Ln,Ie(M)subscript𝐼𝑏𝑀subscript𝐿0subscript𝐿1subscript𝐿𝑛subscript𝐼𝑒𝑀I_{b}(M),L_{0},L_{1},...L_{n},I_{e}(M)italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_M ) , italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_M )

The layer Ib(M)subscript𝐼𝑏𝑀I_{b}(M)italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_M ) is an iteration layer or input boundary that takes the input tensor T1subscript𝑇1T_{-1}italic_T start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT and carves out a sub-tensor shape and size (uo1)s+ksubscript𝑢𝑜1𝑠𝑘(u_{o}-1)s+k( italic_u start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - 1 ) italic_s + italic_k. The layer Ie(M)subscript𝐼𝑒𝑀I_{e}(M)italic_I start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_M ) is an iteration layer or output boundary that takes the input tensor uosubscript𝑢𝑜u_{o}italic_u start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT and copies it to a tensor of shape and size Tjsubscript𝑇𝑗T_{j}italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. For the purpose of the memory allocation, we have a legal schedule that describes basically the property of the computation. We can do memory allocation.

VI-C Code Generation

A computation is basically defined by its layer and by the addresses in memory of its input and output tensors. The code generation unrolls the iterations as it was a loop and Ibsubscript𝐼𝑏I_{b}italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and Iesubscript𝐼𝑒I_{e}italic_I start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT generate the proper sub tensor addresses and specifying the number of iterations.

The code generation will be for the schedule:

Ib(0),L0,L1,Ln,Ie(0),Ib(M1),L0,L1,Ln,Ie(M1)subscript𝐼𝑏0subscript𝐿0subscript𝐿1subscript𝐿𝑛subscript𝐼𝑒0subscript𝐼𝑏𝑀1subscript𝐿0subscript𝐿1subscript𝐿𝑛subscript𝐼𝑒𝑀1I_{b}(0),L_{0},L_{1},...L_{n},I_{e}(0),\dots I_{b}(M-1),L_{0},L_{1},...L_{n},I% _{e}(M-1)italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( 0 ) , italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( 0 ) , … italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_M - 1 ) , italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_M - 1 )

With a little of book keeping we can write the time estimates as a time series and compare performance.

VI-D Complementary: depth-wise tiling and sparsity

The depth-wise tiling has the tendency to increase the computation number and decrease the DDR loads and stores. As fomulated, this tiling may read multiple times the weights. Sparsity reduce the number of computations and the weights communications. Together have the opportunity to work better.

TABLE II: Execution Time estimates
AIE2 Model Dense sec Sparse sec
2x2 Resnet 2.492347e-02 1.582626e-02
3x3 1.269543e-02 8.661490e-03
4x4 1.077318e-02 7.064918e-03
5x5 failed 4.303485e-03
6x6 5.712521e-03 4.490127e-03
7x7 4.205991e-03 3.212234e-03
8x8 6.376768e-03 4.602027e-03
2x2 IncV3 4.283837e-02 2.440544e-02
3x3 2.386600e-02 1.422390e-02
4x4 1.740967e-02 1.012540e-02
5x5 9.690552e-03 failed
6x6 1.063962e-02 6.439692e-03
7x7 8.727651e-03 failed
8x8 9.093276e-03 5.666152e-03
2x2 VGG16 4.476212e-02 2.608593e-02
3x3 2.53343e-02 1.002015e-02
4x4 1.371000e-02 8.852128e-03
5x5 failed 4.336479e-03
6x6 failed 5.770197e-03
7x7 7.455440e-03 5.288551e-03
8x8 9.203393e-03 6.502333e-03

VII Performance of Sparsity

In Table II, we present the performance of neural networks with sparsity, where the sparsity is applied to all the convolutions (except the first one because there are only three channels and sparsity requires at least eight) for Resnet 50, Inception V3, and VGG16. We estimate the total execution time for three networks and seven configurations in Table II. We report also the case where the compiler fails to generate code. Notice that the configuration 8×8888\times 88 × 8 AIEs is never beneficial. A full investigation is necessary.

Corner cases are represented as failure in Table II. Some cases is because of inability to break the weight tensor evenly. Sometime is for incorrect data management especially for prime numbers. These are all issues that we will address as the technology mature. Please, note that VGG16 using 8x8 configuration is slower than 7x7 (by using sparse). For a symmetric computation too small sub-volume computations make the computation overall more inefficient and requiring more iterations for the same amount of data transfers. This is a case where more HW does not improve performance, which is interesting and relevant.

VII-A Depth-Wise Tiling for VGG16 3x3

We present results for VGG16 because of simplicity and each layer tensor do not fit the three Memtile (of a 3×3333\times 33 × 3 system). We can apply the same approach to Resnet and inception. The generalized convolution idea is applicable.

We take VGG and we instruct the DDR to be 16 times slower (4GBs/16) highlighting the need of fewer DDR communications. We take only the part that requires DDR spills for each layer. In Figure 8, we show the performance for VGG16 using DDR spills for each layer: 0.025s total time.

Refer to caption
(a)
Figure 8: VGG16 3x3 DDR only
Refer to caption
(a)
Figure 9: VGG16 3x3 DDR with 2 tiles
Refer to caption
(a)
Figure 10: VGG16 3x3 DDR with 2 tiles and sparse

We apply depth-wise tiling using two tiles and we achieve better performance at 0.022s. see Figure 9. Sparsity by itself without any tiling can achieve only 0.021s. Sparsity and tiling improves even further and we achieve 0.014s, see Figure 10. We can appreciate the reduction of activation DDR communications thanks to depth-wise tiling and the reduction of computation and weights communication by sparsity.

VIII Conclusions

This is a multifaceted problem and we present a complete solution from training techniques, compilers, code generation, Hardware definition, and time estimations. It is a vertical software system, more complex than just a prototype, and it is used for the validation and comparison of different Hardware designs. A few convolutions have been validated in simulation and in hardware.

This could be seen as a quantization and sparsification problem. For example, how we can reduce the footprint of a CNN network. There are post training techniques that are targeting quantization and unstructured sparsity [10]. We need to be more aggressive and training for it. Our sparsity is not really a property of the model, software can describe it, and the hardware can take advantage; however, we do not need specific hardware support at instruction level. To our knowledge we are the first applying sparsity to Tensor Cores such as AIE2 overlays systematically.

We demonstrated Sparsity can accelerate the computing in Matrix multiplication and Convolution in neural networks. Matrix multiplication is appealing for the application in LLM and application in GPUs. Convolutions is far richer in complexity and it is the work-horse for FPGAs based products and systolic array systems/computations.

References

  • [1] S. Ahmad and L. Scheinkman, “How can we be so dense? the benefits of using highly sparse representations,” 2019.
  • [2] AMD, “rocsparse,” 2020. [Online]. Available: https://rocsparse.readthedocs.io/en/master/
  • [3] G. Bilardi, A. Pietracaprina, and P. D’Alberto, “On the space and access complexity of computation dags,” in Graph-Theoretic Concepts in Computer Science, 26th International Workshop, WG 2000, Konstanz, Germany, June 15-17, 2000, Proceedings.   Springer, 2000.
  • [4] J. Cai, Y. Wei, Z. Wu, S. Peng, and K. Ma, “Inter-layer scheduling space definition and exploration for tiled accelerators,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023.
  • [5] F. Chollet, “Keras,” 2015. [Online]. Available: https://keras.io
  • [6] P. D’Alberto, “Block sparsity and training,” https://github.com/paolodalberto/BlockSparsityyTraning, 2020.
  • [7] ——, “Sparsefastmm,” 2020. [Online]. Available: https://github.com/paolodalberto/SparseFastMM/
  • [8] T. A. Davis, “Direct methods for sparse linear systems volume 2 of fundamentals of algorithms,” in SIAM, 2006.
  • [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   IEEE, 2009, pp. 248–255.
  • [10] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” 2023.
  • [11] S. Gray, A. Radford, and D. P. Kingma, “Gpu kernels for block-sparse weights,” 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:52220661
  • [12] ——, “Gpu kernels for block-sparse weights,” in openai.com, 2017.
  • [13] T. Hoefler, D. Alistarh, T. B. Nun, N. Dryden, and A. Peste, “Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks,” in J. Mach. Learn. Res., 2021.
  • [14] J.-W. Hong and H. T. Kung, “I/o complexity: The red-blue pebble game,” in Symposium on the Theory of Computing, 1981. [Online]. Available: https://api.semanticscholar.org/CorpusID:8410593
  • [15] Q. Huang, M. Kang, G. Dinh, T. Norell, A. Kalaiah, J. Demmel, J. Wawrzynek, and Y. S. Shao, “Cosa: Scheduling by constrained optimization for spatial accelerators,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021.
  • [16] K. L. Hunter, L. Spracklen, and S. Ahmad, “Two sparsities are better than one: Unlocking the performance benefits of sparse-sparse networks,” in CoRR abs/2112.13896, 2022.
  • [17] T. Jeong, E. Ghasemi, J. Tuyls, E. Delaye, and A. Sirasao, “Neural network pruning and hardware acceleration,” in 2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing (UCC), 2020.
  • [18] M. Kurtz, J. Kopinsky, R. Gelashvili, A. Matveev, J. Carr, M. Goin, W. Leiserson, S. Moore, N. Shavit, and D. Alistarh, Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks.   PMLR, 2020.
  • [19] Z. Li, D. Orr, V. Ohan, G. D. costa, T. Murray, A. Sanders, D. Beker, and D. Masters, “Popsparse: Accelerated block sparse matrix multiplication on ipu,” 2023.
  • [20] NVIDIA, “cusparse,” 2020. [Online]. Available: https://developer.nvidia.com/cusparse
  • [21] J. Pool, “Accelerating sparsity in the nvidia ampere architecture,” in gtc/s22085, 2020.
  • [22] E. Russo, M. Palesi, G. Ascia, D. Patti, S. Monteleone, and V. Catania, “Memory-aware dnn algorithm-hardware mapping via integer linear programming,” in Proceedings of the 20th ACM International Conference on Computing Frontiers, 2023.
  • [23] T. Saad, “Iterative methods for sparse linear systems,” in SIAM, 2003.
  • [24] Y. Saad, “Iterative methods for linear systems of equations: A brief historical journey,” in In Susanne C. Brenner, Igor E. Shparlinski, Chi-Wang Shu, and Daniel B. Szyld, editors, 75 Years of Mathematics of Computation, volume 754 of Contemporary Mathematics. American Mathematical Society, 2020.
  • [25] A. E. Siswanto, “Block sparsity and weight initialization in neural network pruning,” in PhD thesis, Massachusetts Institute of Technology, 2021.
  • [26] Z. Yao, A. Gholami, S. Shen, K. Keutzer, and M. W. Mahoney, “Adahessian: An adaptive second order optimizer for machine learning,” in AAAI, 2021.
  • [27] S. Yu, Z. Yao, A. Gholami, Z. Dong, M. W. Mahoney, and K. Keutzer, “Hessian-aware pruning and optimal neural implant,” 2021.
  • [28] B. Zandonati, A. A. Pol, M. Pierini, O. Sirkin, and T. Kopetz, “Fit: A metric for model sensitivity,” 2022.