Search | arXiv e-print repository

Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators

Authors: Paolo D'Alberto, Taehee Jeong, Akshai Jain, Shreyas Manjunath, Mrinal Sarmah, Samuel Hsu, Yaswanth Raparti, Nitesh Pipralia

Abstract: Nowadays, increasingly larger Deep Neural Networks (DNNs) are being developed, trained, and utilized. These networks require significant computational resources, putting a strain on both advanced and limited devices. Our solution is to implement {\em weight block sparsity}, which is a structured sparsity that is friendly to hardware. By zeroing certain sections of the convolution and fully connect… ▽ More Nowadays, increasingly larger Deep Neural Networks (DNNs) are being developed, trained, and utilized. These networks require significant computational resources, putting a strain on both advanced and limited devices. Our solution is to implement {\em weight block sparsity}, which is a structured sparsity that is friendly to hardware. By zeroing certain sections of the convolution and fully connected layers parameters of pre-trained DNN models, we can efficiently speed up the DNN's inference process. This results in a smaller memory footprint, faster communication, and fewer operations. Our work presents a vertical system that allows for the training of convolution and matrix multiplication weights to exploit 8x8 block sparsity on a single GPU within a reasonable amount of time. Compilers recognize this sparsity and use it for both data compaction and computation splitting into threads. Blocks like these take full advantage of both spatial and temporal locality, paving the way for fast vector operations and memory reuse. By using this system on a Resnet50 model, we were able to reduce the weight by half with minimal accuracy loss, resulting in a two-times faster inference speed. We will present performance estimates using accurate and complete code generation for AIE2 configuration sets (AMD Versal FPGAs) with Resnet50, Inception V3, and VGG16 to demonstrate the necessary synergy between hardware overlay designs and software stacks for compiling and executing machine learning applications. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: 12 pages, 10 figures, 1 table

ACM Class: C.5; D.3.4

arXiv:2312.12732 [pdf, other]

Strassen's Matrix Multiplication Algorithm Is Still Faster

Authors: Paolo D'Alberto

Abstract: Recently, reinforcement algorithms discovered new algorithms that really jump-started a wave of excitements and a flourishing of publications. However, there is little on implementations, applications, and, especially, no absolute performance and, we show here they are not here to replace Strassen's original fast matrix multiplication yet. We present Matrix Flow, this is a simple Python project fo… ▽ More Recently, reinforcement algorithms discovered new algorithms that really jump-started a wave of excitements and a flourishing of publications. However, there is little on implementations, applications, and, especially, no absolute performance and, we show here they are not here to replace Strassen's original fast matrix multiplication yet. We present Matrix Flow, this is a simple Python project for the automatic formulation, design, implementation, code generation, and execution of fast matrix multiplication algorithms for CPUs, using BLAS interface GPUs, and in the future other accelerators. We shall not play with module-2 (Z2) algorithms and, for simplicity, we present only square double-precision matrices. By means of factorizing the operand matrices we can express many algorithms and prove them correct. These algorithms are represented by Data Flows and matrix data partitions: a Directed Acyclic Graph. We show that Strassen's original algorithm is still the top choice even for modern GPUs. We also address error analysis in double precision, because integer computations are correct, always △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: 8 pages, 2 images, mathematical software

MSC Class: 97N80 ACM Class: G.4

arXiv:2308.00106 [pdf, other]

Entropy Maximization in Sparse Matrix by Vector Multiplication ($\max_E SpMV$)

Authors: Paolo D'Alberto, Abhishek Jain, Ismail Bustany, Henri Fraisse, Mansimran Benipal

Abstract: The peak performance of any SpMV depends primarily on the available memory bandwidth and its effective use. GPUs, ASICs, and new FPGAs have higher and higher bandwidth; however, for large scale and highly sparse matrices, SpMV is still a hard problem because of its random access pattern and workload imbalance. Here, we show how to turn randomness to our advantage. We propose a matrix permutation p… ▽ More The peak performance of any SpMV depends primarily on the available memory bandwidth and its effective use. GPUs, ASICs, and new FPGAs have higher and higher bandwidth; however, for large scale and highly sparse matrices, SpMV is still a hard problem because of its random access pattern and workload imbalance. Here, we show how to turn randomness to our advantage. We propose a matrix permutation pre-processing step that aims to maximize the entropy of the distribution of the nonzero elements. We seek any permutation that uniformly distributes the non-zero elements' distribution, thereby generating a SpMV problem that is amenable to work load balancing or to speed up sort algorithms. We conjecture these permutations would be most effective for matrices with no dense rows or columns and, as in preconditioning, when the matrix is reused. We shall show that entropy maximization is an optimization that any architecture may take advantage although in different ways. Most importantly, any developer can consider and deploy. We shall present cases where we can improve performance by 15\% on AMD-based (GPU-CPU) systems. △ Less

Submitted 24 July, 2023; originally announced August 2023.

Comments: 26 pages

arXiv:2307.12875 [pdf, other]

Digital Advertising: the Measure of Mobile Visits Lifts

Authors: Paolo D'Alberto, Veronica Milenkiy, Fairiz Fi Azizi

Abstract: Mobile-phone advertising enables marketers to reach customers at a personal level and it enables the measure of costumers reaction by novel approaches, in real time, and at scale. By keeping a device anonymous, we can deliver custom adverts and we can check when the device owner will visit a specific mortar-and-brick location. This is the first step in a sale. By measuring visits and sales, the or… ▽ More Mobile-phone advertising enables marketers to reach customers at a personal level and it enables the measure of costumers reaction by novel approaches, in real time, and at scale. By keeping a device anonymous, we can deliver custom adverts and we can check when the device owner will visit a specific mortar-and-brick location. This is the first step in a sale. By measuring visits and sales, the original marketers can determine their return on advertising and they can prove the efficacy of the marketing investments. We turn our attention to the measure of lift: we define it as the visit acceleration during the campaign flight with respect to a controlled baseline. We present a theoretical description; we describe a general and a simplified approach in composing the exposed and the control baseline; we develop two different vertical approaches with different comparable solutions; finally, we present how to carry the experiments and the measures for a few dozens campaigns; these campaigns range from hundred thousands devices and counting a few hundred visits to a handful locations, to sixty million devices and counting million visits to thousands locations. We care about experiments at scale. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Comments: 27 pages, 18 figures

ACM Class: G.3; A.3; B.3; C.3

arXiv:2110.04327 [pdf, other]

DPUV3INT8: A Compiler View to programmable FPGA Inference Engines

Authors: Paolo D'Alberto, Jiangsha Ma, Jintao Li, Yiming Hu, Manasa Bollavaram, Shaoxia Fang

Abstract: We have a FPGA design, we make it fast, efficient, and tested for a few important examples. Now we must infer a general solution to deploy in the data center. Here, we describe the FPGA DPUV3INT8 design and our compiler effort. The hand-tuned SW-HW solution for Resnet50\_v1 has (close to) 2 times better images per second (throughput) than our best FPGA implementation; the compiler generalizes the… ▽ More We have a FPGA design, we make it fast, efficient, and tested for a few important examples. Now we must infer a general solution to deploy in the data center. Here, we describe the FPGA DPUV3INT8 design and our compiler effort. The hand-tuned SW-HW solution for Resnet50\_v1 has (close to) 2 times better images per second (throughput) than our best FPGA implementation; the compiler generalizes the hand written techniques achieving about 1.5 times better performance for the same example, the compiler generalizes the optimizations to a model zoo of networks, and it achieves 80+\% HW efficiency. △ Less

Submitted 8 October, 2021; originally announced October 2021.

Comments: 11 pages

arXiv:1805.07941 [pdf, other]

Quantizing Convolutional Neural Networks for Low-Power High-Throughput Inference Engines

Authors: Sean O. Settle, Manasa Bollavaram, Paolo D'Alberto, Elliott Delaye, Oscar Fernandez, Nicholas Fraser, Aaron Ng, Ashish Sirasao, Michael Wu

Abstract: Deep learning as a means to inferencing has proliferated thanks to its versatility and ability to approach or exceed human-level accuracy. These computational models have seemingly insatiable appetites for computational resources not only while training, but also when deployed at scales ranging from data centers all the way down to embedded devices. As such, increasing consideration is being made… ▽ More Deep learning as a means to inferencing has proliferated thanks to its versatility and ability to approach or exceed human-level accuracy. These computational models have seemingly insatiable appetites for computational resources not only while training, but also when deployed at scales ranging from data centers all the way down to embedded devices. As such, increasing consideration is being made to maximize the computational efficiency given limited hardware and energy resources and, as a result, inferencing with reduced precision has emerged as a viable alternative to the IEEE 754 Standard for Floating-Point Arithmetic. We propose a quantization scheme that allows inferencing to be carried out using arithmetic that is fundamentally more efficient when compared to even half-precision floating-point. Our quantization procedure is significant in that we determine our quantization scheme parameters by calibrating against its reference floating-point model using a single inference batch rather than (re)training and achieve end-to-end post quantization accuracies comparable to the reference model. △ Less

Submitted 21 May, 2018; originally announced May 2018.

arXiv:1501.02185 [pdf, other]

Multiple-Campaign Ad-Targeting Deployment: Parallel Response Modeling, Calibration and Scoring Without Personal User Information

Authors: Paolo D'Alberto

Abstract: We present a vertical introduction to campaign optimization; that is, the ability to predict the user response to an ad campaign without any users' profiles on average and for each exposed ad. In practice, we present an approach to build a polytomous model, multi response, composed by several hundred binary models using generalized linear models. The theory has been introduced twenty years ago and… ▽ More We present a vertical introduction to campaign optimization; that is, the ability to predict the user response to an ad campaign without any users' profiles on average and for each exposed ad. In practice, we present an approach to build a polytomous model, multi response, composed by several hundred binary models using generalized linear models. The theory has been introduced twenty years ago and it has been applied in different fields since then. Here, we show how we optimize hundreds campaigns and how this large number of campaigns may overcome a few characteristic caveats of single campaign optimization. We discuss the problem and solution of training and calibration at scale. We present statistical performance as {\em coverage}, {\em precision} and {\em recall} used in classification. We present also a discussion about the potential performance as throughput: how many decisions can be done per second streaming the bid auctions also by using dedicated hardware. △ Less

Submitted 10 May, 2015; v1 submitted 2 January, 2015; originally announced January 2015.

arXiv:1501.00491 [pdf, other]

Mapping and Matching Algorithms: Data Mining by Adaptive Graphs

Authors: Paolo D'Alberto, Veronica Milenkly

Abstract: Assume we have two bijective functions $U(x)$ and $M(x)$ with $M(x)\neq U(x)$ for all $x$ and $M,N: \N \rightarrow \N$ . Every day and in different locations, we see the different results of $U$ and $M$ without seeing $x$. We are not assured about the time stamp nor the order within the day but at least the location is fully defined. We want to find the matching between $U(x)$ and $M(x)$ (i.e., we… ▽ More Assume we have two bijective functions $U(x)$ and $M(x)$ with $M(x)\neq U(x)$ for all $x$ and $M,N: \N \rightarrow \N$ . Every day and in different locations, we see the different results of $U$ and $M$ without seeing $x$. We are not assured about the time stamp nor the order within the day but at least the location is fully defined. We want to find the matching between $U(x)$ and $M(x)$ (i.e., we will not know $x$). We formulate this problem as an adaptive graph mining: we develop the theory, the solution, and the implementation. This work stems from a practical problem thus our definitions. The solution is simple, clear, and the implementation parallel and efficient. In our experience, the problem and the solution are novel and we want to share our finding. △ Less

Submitted 2 January, 2015; originally announced January 2015.

arXiv:1205.2927 [pdf, other]

A Heterogeneous Accelerated Matrix Multiplication: OpenCL + APU + GPU+ Fast Matrix Multiply

Authors: Paolo D'Alberto

Abstract: As users and developers, we are witnessing the opening of a new computing scenario: the introduction of hybrid processors into a single die, such as an accelerated processing unit (APU) processor, and the plug-and-play of additional graphics processing units (GPUs) onto a single motherboard. These APU processors provide multiple symmetric cores with their memory hierarchies and an integrated GPU.… ▽ More As users and developers, we are witnessing the opening of a new computing scenario: the introduction of hybrid processors into a single die, such as an accelerated processing unit (APU) processor, and the plug-and-play of additional graphics processing units (GPUs) onto a single motherboard. These APU processors provide multiple symmetric cores with their memory hierarchies and an integrated GPU. Moreover, these processors are designed to work with external GPUs that can push the peak performance towards the TeraFLOPS boundary. We present a case study for the development of dense Matrix Multiplication (MM) codes for matrix sizes up to 19K\times19K, thus using all of the above computational engines, and an achievable peak performance of 200 GFLOPS for, literally, a made- at-home built. We present the results of our experience, the quirks, the pitfalls, the achieved performance, and the achievable peak performance. △ Less

Submitted 13 May, 2012; originally announced May 2012.

Comments: 15 pages, 6 Figure, Fusion AMD Fusion Developer Summit 2012

ACM Class: G.4

arXiv:1107.2691 [pdf, other]

On the Weakenesses of Correlation Measures used for Search Engines' Results (Unsupervised Comparison of Search Engine Rankings)

Authors: Paolo D'Alberto, Ali Dasdan

Abstract: The correlation of the result lists provided by search engines is fundamental and it has deep and multidisciplinary ramifications. Here, we present automatic and unsupervised methods to assess whether or not search engines provide results that are comparable or correlated. We have two main contributions: First, we provide evidence that for more than 80% of the input queries - independently of thei… ▽ More The correlation of the result lists provided by search engines is fundamental and it has deep and multidisciplinary ramifications. Here, we present automatic and unsupervised methods to assess whether or not search engines provide results that are comparable or correlated. We have two main contributions: First, we provide evidence that for more than 80% of the input queries - independently of their frequency - the two major search engines share only three or fewer URLs in their search results, leading to an increasing divergence. In this scenario (divergence), we show that even the most robust measures based on comparing lists is useless to apply; that is, the small contribution by too few common items will infer no confidence. Second, to overcome this problem, we propose the fist content-based measures - i.e., direct comparison of the contents from search results; these measures are based on the Jaccard ratio and distribution similarity measures (CDF measures). We show that they are orthogonal to each other (i.e., Jaccard and distribution) and extend the discriminative power w.r.t. list based measures. Our approach stems from the real need of comparing search-engine results, it is automatic from the query selection to the final evaluation and it apply to any geographical markets, thus designed to scale and to use as first filtering of query selection (necessary) for supervised methods. △ Less

Submitted 13 July, 2011; originally announced July 2011.

Comments: 16 pages, 19 figures

Showing 1–10 of 10 results for author: D'Alberto, P