Search | arXiv e-print repository

Exploring FPGA designs for MX and beyond

Authors: Ebby Samson, Naveen Mellempudi, Wayne Luk, George A. Constantinides

Abstract: A number of companies recently worked together to release the new Open Compute Project MX standard for low-precision computation, aimed at efficient neural network implementation. In this paper, we describe and evaluate the first open-source FPGA implementation of the arithmetic defined in the standard. Our designs fully support all the standard's concrete formats for conversion into and out of MX… ▽ More A number of companies recently worked together to release the new Open Compute Project MX standard for low-precision computation, aimed at efficient neural network implementation. In this paper, we describe and evaluate the first open-source FPGA implementation of the arithmetic defined in the standard. Our designs fully support all the standard's concrete formats for conversion into and out of MX formats and for the standard-defined arithmetic operations, as well as arbitrary fixed-point and floating-point formats. Certain elements of the standard are left as implementation-defined, and we present the first concrete FPGA-inspired choices for these elements, which we outline in the paper. Our library of optimized hardware components is available open source, and can be used to build larger systems. For this purpose, we also describe and release an open-source Pytorch library for quantization into the new standard, integrated with the Brevitas library so that the community can develop novel neural network designs quantized with MX formats in mind. We demonstrate the usability and efficacy of our libraries via the implementation of example neural networks such as ResNet-18 on the ImageNet ILSVRC12 dataset. Our testing shows that MX is very effective for formats such as INT5 or FP6 which are not natively supported on GPUs. This gives FPGAs an advantage as they have the flexibility to implement a custom datapath and take advantage of the smaller area footprints offered by these formats. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 8 pages, 4 figures

arXiv:2406.14963 [pdf, other]

Optimised Grouped-Query Attention Mechanism for Transformers

Authors: Yuang Chen, Cheng Zhang, Xitong Gao, Robert D. Mullins, George A. Constantinides, Yiren Zhao

Abstract: Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our Asy… ▽ More Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5% on MMLU compared to neighbour grouping. Our approach addresses the GQA's trade-off problem between model performance and hardware efficiency. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: Accepted at ICML2024 ES-FoMo-II Workshop

arXiv:2406.14956 [pdf, other]

Unlocking the Global Synergies in Low-Rank Adapters

Authors: Zixi Zhang, Cheng Zhang, Xitong Gao, Robert D. Mullins, George A. Constantinides, Yiren Zhao

Abstract: Low-rank Adaption (LoRA) has been the de-facto parameter-efficient fine-tuning technique for large language models. We present HeteroLoRA, a light-weight search algorithm that leverages zero-cost proxies to allocate the limited LoRA trainable parameters across the model for better fine-tuned performance. In addition to the allocation for the standard LoRA-adapted models, we also demonstrate the ef… ▽ More Low-rank Adaption (LoRA) has been the de-facto parameter-efficient fine-tuning technique for large language models. We present HeteroLoRA, a light-weight search algorithm that leverages zero-cost proxies to allocate the limited LoRA trainable parameters across the model for better fine-tuned performance. In addition to the allocation for the standard LoRA-adapted models, we also demonstrate the efficacy of HeteroLoRA by performing the allocation in a more challenging search space that includes LoRA modules and LoRA-adapted shortcut connections. Experiments show that HeteroLoRA enables improvements in model performance given the same parameter budge. For example, on MRPC, we see an improvement of 1.6% in accuracy with similar training parameter budget. We will open-source our algorithm once the paper is accepted. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: Accepted at ICML2024 ES-FoMo-II Workshop

arXiv:2406.12421 [pdf, other]

ROVER: RTL Optimization via Verified E-Graph Rewriting

Authors: Samuel Coward, Theo Drane, George A. Constantinides

Abstract: Manual RTL design and optimization remains prevalent across the semiconductor industry because commercial logic and high-level synthesis tools are unable to match human designs. Our experience in industrial datapath design demonstrates that manual optimization can typically be decomposed into a sequence of local equivalence preserving transformations. By formulating datapath optimization as a grap… ▽ More Manual RTL design and optimization remains prevalent across the semiconductor industry because commercial logic and high-level synthesis tools are unable to match human designs. Our experience in industrial datapath design demonstrates that manual optimization can typically be decomposed into a sequence of local equivalence preserving transformations. By formulating datapath optimization as a graph rewriting problem we automate design space exploration in a tool we call ROVER. We develop a set of mixed precision RTL rewrite rules inspired by designers at Intel and an accompanying automated validation framework. A particular challenge in datapath design is to determine a productive order in which to apply transformations as this can be design dependent. ROVER resolves this problem by building upon the e-graph data structure, which compactly represents a design space of equivalent implementations. By applying rewrites to this data structure, ROVER generates a set of efficient and functionally equivalent design options. From the ROVER generated e-graph we select an efficient implementation. To accurately model the circuit area we develop a theoretical cost metric and then an integer linear programming model to extract the optimal implementation. To build trust in the generated design ROVER also produces a back-end verification certificate that can be checked using industrial tools. We apply ROVER to both Intel-provided and open-source benchmarks, and see up to a 63% reduction in circuit area. ROVER is also able to generate a customized library of distinct implementations from a given parameterizable RTL design, improving circuit area across the range of possible instantiations. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.03227 [pdf, other]

Soft GPGPU versus IP cores: Quantifying and Reducing the Performance Gap

Authors: Martin Langhammer, George A. Constantinides

Abstract: eGPU, a recently-reported soft GPGPU for FPGAs, has demonstrated very high clock frequencies (more than 750 MHz) and small footprint. This means that for the first time, commercial soft processors may be competitive for the kind of heavy numerical computations common in FPGA-based digital signal processing. In this paper we take a deep dive into the performance of the eGPU family on FFT computatio… ▽ More eGPU, a recently-reported soft GPGPU for FPGAs, has demonstrated very high clock frequencies (more than 750 MHz) and small footprint. This means that for the first time, commercial soft processors may be competitive for the kind of heavy numerical computations common in FPGA-based digital signal processing. In this paper we take a deep dive into the performance of the eGPU family on FFT computation, in order to quantify the performance gap between state-of-the-art soft processors and commercial IP cores specialized for this task. In the process, we propose two novel architectural features for the eGPU that improve the efficiency of the design by 50\% when executing the FFTs. The end-result is that our modified GPGPU takes only 3 times the performance-area product of a specialized IP core, yet as a programmable processor is able to execute arbitrary software-defined algorithms. Further comparison to Nvidia A100 GPGPUs demonstrates the superior efficiency of eGPU on FFTs of the size studied (256 to 4096-point). △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2403.00849 [pdf, other]

NeuraLUT: Hiding Neural Network Density in Boolean Synthesizable Functions

Authors: Marta Andronic, George A. Constantinides

Abstract: Field-Programmable Gate Array (FPGA) accelerators have proven successful in handling latency- and resource-critical deep neural network (DNN) inference tasks. Among the most computationally intensive operations in a neural network (NN) is the dot product between the feature and weight vectors. Thus, some previous FPGA acceleration works have proposed mapping neurons with quantized inputs and outpu… ▽ More Field-Programmable Gate Array (FPGA) accelerators have proven successful in handling latency- and resource-critical deep neural network (DNN) inference tasks. Among the most computationally intensive operations in a neural network (NN) is the dot product between the feature and weight vectors. Thus, some previous FPGA acceleration works have proposed mapping neurons with quantized inputs and outputs directly to lookup tables (LUTs) for hardware implementation. In these works, the boundaries of the neurons coincide with the boundaries of the LUTs. We propose relaxing these boundaries and mapping entire sub-networks to a single LUT. As the sub-networks are absorbed within the LUT, the NN topology and precision within a partition do not affect the size of the lookup tables generated. Therefore, we utilize fully connected layers with floating-point precision inside each partition, which benefit from being universal function approximators, but with rigid sparsity and quantization enforced between partitions, where the NN topology becomes exposed to the circuit topology. Although cheap to implement, this approach can lead to very deep NNs, and so to tackle challenges like vanishing gradients, we also introduce skip connections inside the partitions. The resulting methodology can be seen as training DNNs with a specific FPGA hardware-inspired sparsity pattern that allows them to be mapped to much shallower circuit-level networks, thereby significantly improving latency. We validate our proposed method on a known latency-critical task, jet substructure tagging, and on the classical computer vision task, digit classification using MNIST. Our approach allows for greater function expressivity within the LUTs compared to existing work, leading to up to $4.3\times$ lower latency NNs for the same accuracy. △ Less

Submitted 3 July, 2024; v1 submitted 29 February, 2024; originally announced March 2024.

arXiv:2402.02446 [pdf, other]

LQER: Low-Rank Quantization Error Reconstruction for LLMs

Authors: Cheng Zhang, Jianyi Cheng, George A. Constantinides, Yiren Zhao

Abstract: Post-training quantization of Large Language Models (LLMs) is challenging. In this work, we introduce Low-rank Quantization Error Reduction (LQER), which combines quantization and low-rank approximation to recover the model capability. LQER leverages an activation-induced scale matrix to drive the singular value distribution of quantization error towards a desirable distribution, which enables nea… ▽ More Post-training quantization of Large Language Models (LLMs) is challenging. In this work, we introduce Low-rank Quantization Error Reduction (LQER), which combines quantization and low-rank approximation to recover the model capability. LQER leverages an activation-induced scale matrix to drive the singular value distribution of quantization error towards a desirable distribution, which enables nearly-lossless W4A8 quantization on various LLMs and downstream tasks without the need for knowledge distillation, grid search, or gradient-base iterative optimization. Unlike existing methods, the computation pattern of LQER eliminates the need for specialized Scatter and Gather processes to collect high-precision weights from irregular memory locations. Our W4A8 LLMs achieve near-lossless performance on six popular downstream tasks, while using 1.36$\times$ fewer hardware resources than the leading state-of-the-art method. We open-source our framework at https://github.com/ChengZhang-98/lqer △ Less

Submitted 30 May, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

Comments: Accepted at ICML2024

arXiv:2401.04261 [pdf, other]

A Statically and Dynamically Scalable Soft GPGPU

Authors: Martin Langhammer, George A. Constantinides

Abstract: Current soft processor architectures for FPGAs do not utilize the potential of the massive parallelism available. FPGAs now support many thousands of embedded floating point operators, and have similar computational densities to GPGPUs. Several soft GPGPU or SIMT processors have been published, but the reported large areas and modest Fmax makes their widespread use unlikely for commercial designs.… ▽ More Current soft processor architectures for FPGAs do not utilize the potential of the massive parallelism available. FPGAs now support many thousands of embedded floating point operators, and have similar computational densities to GPGPUs. Several soft GPGPU or SIMT processors have been published, but the reported large areas and modest Fmax makes their widespread use unlikely for commercial designs. In this paper we take an alternative approach, building the soft GPU microarchitecture around the FPGA resource mix available. We demonstrate a statically scalable soft GPGPU processor (where both parameters and feature set can be determined at configuration time) that always closes timing at the peak speed of the slowest embedded component in the FPGA (DSP or hard memory), with a completely unconstrained compile into a current Intel Agilex FPGA. We also show dynamic scalability, where a subset of the thread space can be specified on an instruction-by-instruction basis. For one example core type, we show a logic range -- depending on the configuration -- of 4k to 10k ALMs, along with 24 to 32 DSP Blocks, and 50 to 250 M20K memories. All of these instances close timing at 771 MHz, a performance level limited only by the DSP Blocks. We describe our methodology for reliably achieving this clock rate by matching the processor pipeline structure to the physical structure of the FPGA fabric. We also benchmark several algorithms across a range of data sizes, and compare to a commercial soft RISC processor. △ Less

Submitted 8 January, 2024; originally announced January 2024.

arXiv:2312.06004 [pdf, other]

Multiplier Optimization via E-Graph Rewriting

Authors: Andy Wanna, Samuel Coward, Theo Drane, George A. Constantinides, Miloš D. Ercegovac

Abstract: Multiplier circuits account for significant resource usage in datapath-dominated circuit designs, and RTL designers continue to build bespoke hand-crafted multiplication arrays for their particular application. The construction of an optimized multiplier presents trade-offs between pre-processing to generate a smaller array and array reduction. A data structure known as an e-graph has recently bee… ▽ More Multiplier circuits account for significant resource usage in datapath-dominated circuit designs, and RTL designers continue to build bespoke hand-crafted multiplication arrays for their particular application. The construction of an optimized multiplier presents trade-offs between pre-processing to generate a smaller array and array reduction. A data structure known as an e-graph has recently been applied to datapath optimization, where the e-graph's ability to efficiently explore trade-offs has been shown to be crucial. We propose an e-graph based rewriting framework to construct optimized multiplier circuits. Such a framework can express alternative multiplier representations and generate customized circuit designs. We demonstrate that the proposed tool, which we call OptiMult, can reduce the latency of a squarer by up to 46% and reduce the latency of a standard multiplier by up to 9% when compared against logic synthesis instantiated components. △ Less

Submitted 10 December, 2023; originally announced December 2023.

Comments: Preprint for work presented at the 2023 Asilomar Conference on Signals, Systems and Computers

arXiv:2310.05079 [pdf, other]

doi 10.18653/v1/2023.emnlp-main.617

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

Authors: Cheng Zhang, Jianyi Cheng, Ilia Shumailov, George A. Constantinides, Yiren Zhao

Abstract: The inference of Large language models (LLMs) requires immense computation and memory resources. To curtail these costs, quantisation has merged as a promising solution, but existing LLM quantisation mainly focuses on 8-bit. In this work, we explore the statistical and learning properties of the LLM layer and attribute the bottleneck of LLM quantisation to numerical scaling offsets. To address thi… ▽ More The inference of Large language models (LLMs) requires immense computation and memory resources. To curtail these costs, quantisation has merged as a promising solution, but existing LLM quantisation mainly focuses on 8-bit. In this work, we explore the statistical and learning properties of the LLM layer and attribute the bottleneck of LLM quantisation to numerical scaling offsets. To address this, we adapt block quantisations for LLMs, a family of methods that share scaling factors across packed numbers. Block quantisations efficiently reduce the numerical scaling offsets solely from an arithmetic perspective, without additional treatments in the computational path. Our nearly-lossless quantised 6-bit LLMs achieve a $19\times$ higher arithmetic density and $5\times$ memory density than the float32 baseline, surpassing the prior art 8-bit quantisation by $2.5\times$ in arithmetic density and $1.2\times$ in memory density, without requiring any data calibration or re-training. We also share our insights into sub-8-bit LLM quantisation, including the mismatch between activation and weight distributions, optimal fine-tuning strategies, and a lower quantisation granularity inherent in the statistical properties of LLMs. The latter two tricks enable nearly-lossless 4-bit LLMs on downstream tasks. Our code is open-sourced. △ Less

Submitted 21 October, 2023; v1 submitted 8 October, 2023; originally announced October 2023.

Comments: Accepted by EMNLP2023

arXiv:2309.02334 [pdf, other]

doi 10.1109/ICFPT59805.2023.00012

PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based Inference

Authors: Marta Andronic, George A. Constantinides

Abstract: Field-programmable gate arrays (FPGAs) are widely used to implement deep learning inference. Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded the combination of linear maps and nonlinear activations inside FPGA lookup tables (LUTs). Our work is motivated by t… ▽ More Field-programmable gate arrays (FPGAs) are widely used to implement deep learning inference. Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded the combination of linear maps and nonlinear activations inside FPGA lookup tables (LUTs). Our work is motivated by the idea that the LUTs in an FPGA can be used to implement a much greater variety of functions than this. In this paper, we propose a novel approach to training neural networks for FPGA deployment using multivariate polynomials as the basic building block. Our method takes advantage of the flexibility offered by the soft logic, hiding the polynomial evaluation inside the LUTs with minimal overhead. We show that by using polynomial building blocks, we can achieve the same accuracy using considerably fewer layers of soft logic than by using linear functions, leading to significant latency and area improvements. We demonstrate the effectiveness of this approach in three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and handwritten digit recognition using the MNIST dataset. △ Less

Submitted 6 November, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

Journal ref: 2023 International Conference on Field Programmable Technology (ICFPT), Yokohama, Japan, 2023, pp. 60-68

arXiv:2308.05170 [pdf, other]

FPGA Resource-aware Structured Pruning for Real-Time Neural Networks

Authors: Benjamin Ramhorst, Vladimir Loncar, George A. Constantinides

Abstract: Neural networks achieve state-of-the-art performance in image classification, speech recognition, scientific analysis and many more application areas. Due to the high computational complexity and memory footprint of neural networks, various compression techniques, such as pruning and quantization, have been proposed in literature. Pruning sparsifies a neural network, reducing the number of multipl… ▽ More Neural networks achieve state-of-the-art performance in image classification, speech recognition, scientific analysis and many more application areas. Due to the high computational complexity and memory footprint of neural networks, various compression techniques, such as pruning and quantization, have been proposed in literature. Pruning sparsifies a neural network, reducing the number of multiplications and memory. However, pruning often fails to capture properties of the underlying hardware, causing unstructured sparsity and load-balance inefficiency, thus bottlenecking resource improvements. We propose a hardware-centric formulation of pruning, by formulating it as a knapsack problem with resource-aware tensor structures. Evaluated on a range of tasks, including sub-microsecond particle classification at CERN's Large Hadron Collider and fast image classification, the proposed method achieves reductions ranging between 55% and 92% in the DSP utilization and up to 81% in BRAM utilization. △ Less

Submitted 12 December, 2023; v1 submitted 9 August, 2023; originally announced August 2023.

arXiv:2307.15517 [pdf, other]

A Dataflow Compiler for Efficient LLM Inference using Custom Microscaling Formats

Authors: Jianyi Cheng, Cheng Zhang, Zhewen Yu, Christos-Savvas Bouganis, George A. Constantinides, Yiren Zhao

Abstract: Model quantization represents both parameters (weights) and intermediate values (activations) in a more compact format, thereby directly reducing both computational and memory cost in hardware. The quantization of recent large language models (LLMs) faces challenges to achieve competitive memory density compared to other models such as convolutional neural networks, since values in LLMs require la… ▽ More Model quantization represents both parameters (weights) and intermediate values (activations) in a more compact format, thereby directly reducing both computational and memory cost in hardware. The quantization of recent large language models (LLMs) faces challenges to achieve competitive memory density compared to other models such as convolutional neural networks, since values in LLMs require larger dynamic ranges. Current hardware can expedite computation for LLMs using compact numerical formats such as low-bitwidth integers or floating-point numbers. Each has advantages: integer operations simplify circuit design, whereas floating-point calculations can enhance accuracy when a wider dynamic range is required. In this work, we seek an efficient data format that combines the best of both worlds: Microscaling (MX) formats. MX formats are efficient data formats that achieve both large dynamic ranges and high memory density. In this paper, we propose a compiler named MASE for exploring mixed-precision MX formats on dataflow hardware accelerators for LLM inference. Our main contributions are twofold. First, we propose a novel orchestration abstraction to explore both software and hardware optimizations with new data formats. Second, MASE achieves LLM inference at an average precision of 4-bits, with minimal to no accuracy degradation. To our knowledge, MASE represents the first effort to harness fine-grain multi-precision MX formats in the design of LLM hardware accelerators. Over a range of LLMs and datasets, MASE achieves an average improvement of 24% in $Δ$ accuracy with an overhead of only 3% in energy efficiency compared to designs using 8-bit fixed-point numbers. △ Less

Submitted 19 April, 2024; v1 submitted 28 July, 2023; originally announced July 2023.

arXiv:2304.08400 [pdf, other]

ATHEENA: A Toolflow for Hardware Early-Exit Network Automation

Authors: Benjamin Biggs, Christos-Savvas Bouganis, George A. Constantinides

Abstract: The continued need for improvements in accuracy, throughput, and efficiency of Deep Neural Networks has resulted in a multitude of methods that make the most of custom architectures on FPGAs. These include the creation of hand-crafted networks and the use of quantization and pruning to reduce extraneous network parameters. However, with the potential of static solutions already well exploited, we… ▽ More The continued need for improvements in accuracy, throughput, and efficiency of Deep Neural Networks has resulted in a multitude of methods that make the most of custom architectures on FPGAs. These include the creation of hand-crafted networks and the use of quantization and pruning to reduce extraneous network parameters. However, with the potential of static solutions already well exploited, we propose to shift the focus to using the varying difficulty of individual data samples to further improve efficiency and reduce average compute for classification. Input-dependent computation allows for the network to make runtime decisions to finish a task early if the result meets a confidence threshold. Early-Exit network architectures have become an increasingly popular way to implement such behaviour in software. We create: A Toolflow for Hardware Early-Exit Network Automation (ATHEENA), an automated FPGA toolflow that leverages the probability of samples exiting early from such networks to scale the resources allocated to different sections of the network. The toolflow uses the data-flow model of fpgaConvNet, extended to support Early-Exit networks as well as Design Space Exploration to optimize the generated streaming architecture hardware with the goal of increasing throughput/reducing area while maintaining accuracy. Experimental results on three different networks demonstrate a throughput increase of $2.00\times$ to $2.78\times$ compared to an optimized baseline network implementation with no early exits. Additionally, the toolflow can achieve a throughput matching the same baseline with as low as $46\%$ of the resources the baseline requires. △ Less

Submitted 17 April, 2023; originally announced April 2023.

arXiv:2303.01839 [pdf, other]

Automating Constraint-Aware Datapath Optimization using E-Graphs

Authors: Samuel Coward, George A. Constantinides, Theo Drane

Abstract: Numerical hardware design requires aggressive optimization, where designers exploit branch constraints, creating optimization opportunities that are valid only on a sub-domain of input space. We developed an RTL optimization tool that automatically learns the consequences of conditional branches and exploits that knowledge to enable deep optimization. The tool deploys custom built program analysis… ▽ More Numerical hardware design requires aggressive optimization, where designers exploit branch constraints, creating optimization opportunities that are valid only on a sub-domain of input space. We developed an RTL optimization tool that automatically learns the consequences of conditional branches and exploits that knowledge to enable deep optimization. The tool deploys custom built program analysis based on abstract interpretation theory, which when combined with a data-structure known as an e-graph simplifies complex reasoning about program properties. Our tool fully-automatically discovers known floating-point architectures from the computer arithmetic literature and out-performs baseline EDA tools, generating up to 33% faster and 41% smaller circuits. △ Less

Submitted 3 March, 2023; originally announced March 2023.

arXiv:2205.14989 [pdf, other]

Combining E-Graphs with Abstract Interpretation

Authors: Samuel Coward, George A. Constantinides, Theo Drane

Abstract: E-graphs are a data structure that compactly represents equivalent expressions. They are constructed via the repeated application of rewrite rules. Often in practical applications, conditional rewrite rules are crucial, but their application requires the detection - at the time the e-graph is being built - that a condition is valid in the domain of application. Detecting condition validity amounts… ▽ More E-graphs are a data structure that compactly represents equivalent expressions. They are constructed via the repeated application of rewrite rules. Often in practical applications, conditional rewrite rules are crucial, but their application requires the detection - at the time the e-graph is being built - that a condition is valid in the domain of application. Detecting condition validity amounts to proving a property of the program. Abstract interpretation is a general method to learn such properties, traditionally used in static analysis tools. We demonstrate that abstract interpretation and e-graph analysis naturally reinforce each other through a tight integration because (i) the e-graph clustering of equivalent expressions induces natural precision refinement of abstractions and (ii) precise abstractions allow the application of deeper rewrite rules (and hence potentially even greater precision). We develop the theory behind this intuition and present an exemplar interval arithmetic implementation, which we apply to the FPBench suite. △ Less

Submitted 15 August, 2023; v1 submitted 30 May, 2022; originally announced May 2022.

arXiv:2204.11478 [pdf, other]

Automatic Datapath Optimization using E-Graphs

Authors: Samuel Coward, George A. Constantinides, Theo Drane

Abstract: Manual optimization of Register Transfer Level (RTL) datapath is commonplace in industry but holds back development as it can be very time consuming. We utilize the fact that a complex transformation of one RTL into another equivalent RTL can be broken down into a sequence of smaller, localized transformations. By representing RTL as a graph and deploying modern graph rewriting techniques we can a… ▽ More Manual optimization of Register Transfer Level (RTL) datapath is commonplace in industry but holds back development as it can be very time consuming. We utilize the fact that a complex transformation of one RTL into another equivalent RTL can be broken down into a sequence of smaller, localized transformations. By representing RTL as a graph and deploying modern graph rewriting techniques we can automate the circuit design space exploration, allowing us to discover functionally equivalent but optimized architectures. We demonstrate that modern rewriting frameworks can adequately capture a wide variety of complex optimizations performed by human designers on bit-vector manipulating code, including significant error-prone subtleties regarding the validity of transformations under complex interactions of bitwidths. The proposed automated optimization approach is able to reproduce the results of typical industrial manual optimization, resulting in a reduction in circuit area by up to 71%. Not only does our tool discover optimized RTL, but also correctly identifies that the optimal architecture to implement a given arithmetic expression can depend on the width of the operands, thus producing a library of optimized designs rather than the single design point typically generated by manual optimization. In addition, we demonstrate that prior academic work on maximally exploiting carry-save representation and on multiple constant multiplication are both generalized and extended, falling out as special cases of this paper. △ Less

Submitted 26 July, 2022; v1 submitted 25 April, 2022; originally announced April 2022.

arXiv:2203.09191 [pdf, other]

Abstract Interpretation on E-Graphs

Authors: Samuel Coward, George A. Constantinides, Theo Drane

Abstract: Recent e-graph applications have typically considered concrete semantics of expressions, where the notion of equivalence stems from concrete interpretation of expressions. However, equivalences that hold over one interpretation may not hold in an alternative interpretation. Such an observation can be exploited. We consider the application of abstract interpretation to e-graphs, and show that withi… ▽ More Recent e-graph applications have typically considered concrete semantics of expressions, where the notion of equivalence stems from concrete interpretation of expressions. However, equivalences that hold over one interpretation may not hold in an alternative interpretation. Such an observation can be exploited. We consider the application of abstract interpretation to e-graphs, and show that within an e-graph, the lattice meet operation associated with the abstract domain has a natural interpretation for an e-class, leading to improved precision in over-approximation. In this extended abstract, we use Interval Arithmetic (IA) to illustrate this point. △ Less

Submitted 17 March, 2022; originally announced March 2022.

arXiv:2201.11522 [pdf, other]

High-level Synthesis using the Julia Language

Authors: Benjamin Biggs, Ian McInerney, Eric C. Kerrigan, George A. Constantinides

Abstract: The growing proliferation of FPGAs and High-level Synthesis (HLS) tools has led to a large interest in designing hardware accelerators for complex operations and algorithms. However, existing HLS toolflows typically require a significant amount of user knowledge or training to be effective in both industrial and research applications. In this paper, we propose using the Julia language as the basis… ▽ More The growing proliferation of FPGAs and High-level Synthesis (HLS) tools has led to a large interest in designing hardware accelerators for complex operations and algorithms. However, existing HLS toolflows typically require a significant amount of user knowledge or training to be effective in both industrial and research applications. In this paper, we propose using the Julia language as the basis for an HLS tool. The Julia HLS tool aims to decrease the barrier to entry for hardware acceleration by taking advantage of the readability of the Julia language and by allowing the use of the existing large library of standard mathematical functions written in Julia. We present a prototype Julia HLS tool, written in Julia, that transforms Julia code to VHDL. We highlight how features of Julia and its compiler simplified the creation of this tool, and we discuss potential directions for future work. △ Less

Submitted 17 February, 2022; v1 submitted 27 January, 2022; originally announced January 2022.

Comments: Presented at the 2nd Workshop on Languages, Tools, and Techniques for Accelerator Design (LATTE'22)

arXiv:2112.06887 [pdf, other]

doi 10.1002/advs.202105784

Nonideality-Aware Training for Accurate and Robust Low-Power Memristive Neural Networks

Authors: Dovydas Joksas, Erwei Wang, Nikolaos Barmpatsalos, Wing H. Ng, Anthony J. Kenyon, George A. Constantinides, Adnan Mehonic

Abstract: Recent years have seen a rapid rise of artificial neural networks being employed in a number of cognitive tasks. The ever-increasing computing requirements of these structures have contributed to a desire for novel technologies and paradigms, including memristor-based hardware accelerators. Solutions based on memristive crossbars and analog data processing promise to improve the overall energy eff… ▽ More Recent years have seen a rapid rise of artificial neural networks being employed in a number of cognitive tasks. The ever-increasing computing requirements of these structures have contributed to a desire for novel technologies and paradigms, including memristor-based hardware accelerators. Solutions based on memristive crossbars and analog data processing promise to improve the overall energy efficiency. However, memristor nonidealities can lead to the degradation of neural network accuracy, while the attempts to mitigate these negative effects often introduce design trade-offs, such as those between power and reliability. In this work, we design nonideality-aware training of memristor-based neural networks capable of dealing with the most common device nonidealities. We demonstrate the feasibility of using high-resistance devices that exhibit high $I$-$V$ nonlinearity -- by analyzing experimental data and employing nonideality-aware training, we estimate that the energy efficiency of memristive vector-matrix multipliers is improved by three orders of magnitude ($0.715\ \mathrm{TOPs}^{-1}\mathrm{W}^{-1}$ to $381\ \mathrm{TOPs}^{-1}\mathrm{W}^{-1}$) while maintaining similar accuracy. We show that associating the parameters of neural networks with individual memristors allows to bias these devices towards less conductive states through regularization of the corresponding optimization problem, while modifying the validation procedure leads to more reliable estimates of performance. We demonstrate the universality and robustness of our approach when dealing with a wide range of nonidealities. △ Less

Submitted 5 May, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

Comments: 32 pages, 20 figures, 4 tables

Journal ref: Adv. Sci. 2022, 2105784

arXiv:2112.02346 [pdf, other]

doi 10.1145/3490422.3502360

Logic Shrinkage: Learned FPGA Netlist Sparsity for Efficient Neural Network Inference

Authors: Erwei Wang, James J. Davis, Georgios-Ilias Stavrou, Peter Y. K. Cheung, George A. Constantinides, Mohamed S. Abdelfattah

Abstract: FPGA-specific DNN architectures using the native LUTs as independently trainable inference operators have been shown to achieve favorable area-accuracy and energy-accuracy tradeoffs. The first work in this area, LUTNet, exhibited state-of-the-art performance for standard DNN benchmarks. In this paper, we propose the learned optimization of such LUT-based topologies, resulting in higher-efficiency… ▽ More FPGA-specific DNN architectures using the native LUTs as independently trainable inference operators have been shown to achieve favorable area-accuracy and energy-accuracy tradeoffs. The first work in this area, LUTNet, exhibited state-of-the-art performance for standard DNN benchmarks. In this paper, we propose the learned optimization of such LUT-based topologies, resulting in higher-efficiency designs than via the direct use of off-the-shelf, hand-designed networks. Existing implementations of this class of architecture require the manual specification of the number of inputs per LUT, K. Choosing appropriate K a priori is challenging, and doing so at even high granularity, e.g. per layer, is a time-consuming and error-prone process that leaves FPGAs' spatial flexibility underexploited. Furthermore, prior works see LUT inputs connected randomly, which does not guarantee a good choice of network topology. To address these issues, we propose logic shrinkage, a fine-grained netlist pruning methodology enabling K to be automatically learned for every LUT in a neural network targeted for FPGA inference. By removing LUT inputs determined to be of low importance, our method increases the efficiency of the resultant accelerators. Our GPU-friendly solution to LUT input removal is capable of processing large topologies during their training with negligible slowdown. With logic shrinkage, we better the area and energy efficiency of the best-performing LUTNet implementation of the CNV network classifying CIFAR-10 by 1.54x and 1.31x, respectively, while matching its accuracy. This implementation also reaches 2.71x the area efficiency of an equally accurate, heavily pruned BNN. On ImageNet with the Bi-Real Net architecture, employment of logic shrinkage results in a post-synthesis area reduction of 2.67x vs LUTNet, allowing for implementation that was previously impossible on today's largest FPGAs. △ Less

Submitted 2 January, 2022; v1 submitted 4 December, 2021; originally announced December 2021.

Comments: Accepted manuscript uploaded 04/12/21. DOA 22/11/21

arXiv:2102.04270 [pdf, other]

Enabling Binary Neural Network Training on the Edge

Authors: Erwei Wang, James J. Davis, Daniele Moro, Piotr Zielinski, Jia Jie Lim, Claudionor Coelho, Satrajit Chatterjee, Peter Y. K. Cheung, George A. Constantinides

Abstract: The ever-growing computational demands of increasingly complex machine learning models frequently necessitate the use of powerful cloud-based infrastructure for their training. Binary neural networks are known to be promising candidates for on-device inference due to their extreme compute and memory savings over higher-precision alternatives. However, their existing training methods require the co… ▽ More The ever-growing computational demands of increasingly complex machine learning models frequently necessitate the use of powerful cloud-based infrastructure for their training. Binary neural networks are known to be promising candidates for on-device inference due to their extreme compute and memory savings over higher-precision alternatives. However, their existing training methods require the concurrent storage of high-precision activations for all layers, generally making learning on memory-constrained devices infeasible. In this article, we demonstrate that the backward propagation operations needed for binary neural network training are strongly robust to quantization, thereby making on-the-edge learning with modern models a practical proposition. We introduce a low-cost binary neural network training strategy exhibiting sizable memory footprint reductions while inducing little to no accuracy loss vs Courbariaux & Bengio's standard approach. These decreases are primarily enabled through the retention of activations exclusively in binary format. Against the latter algorithm, our drop-in replacement sees memory requirement reductions of 3--5$\times$, while reaching similar test accuracy in comparable time, across a range of small-scale models trained to classify popular datasets. We also demonstrate from-scratch ImageNet training of binarized ResNet-18, achieving a 3.78$\times$ memory reduction. Our work is open-source, and includes the Raspberry Pi-targeted prototype we used to verify our modeled memory decreases and capture the associated energy drops. Such savings will allow for unnecessary cloud offloading to be avoided, reducing latency, increasing energy efficiency, and safeguarding end-user privacy. △ Less

Submitted 24 September, 2023; v1 submitted 8 February, 2021; originally announced February 2021.

arXiv:1912.00867 [pdf, other]

A Probabilistic Approach to Floating-Point Arithmetic

Authors: Fredrik Dahlqvist, Rocco Salvia, George A Constantinides

Abstract: Finite-precision floating point arithmetic unavoidably introduces rounding errors which are traditionally bounded using a worst-case analysis. However, worst-case analysis might be overly conservative because worst-case errors can be extremely rare events in practice. Here we develop a probabilistic model of rounding errors with which it becomes possible to estimate the likelihood that the roundin… ▽ More Finite-precision floating point arithmetic unavoidably introduces rounding errors which are traditionally bounded using a worst-case analysis. However, worst-case analysis might be overly conservative because worst-case errors can be extremely rare events in practice. Here we develop a probabilistic model of rounding errors with which it becomes possible to estimate the likelihood that the rounding error of an algorithm lies within a given interval. Given an input distribution, we show how to compute the distribution of rounding errors. We do this exactly for low precision arithmetic, for high precision arithmetic we derive a simple approximation. The model is then entirely compositional: given a numerical program written in a simple imperative programming language we can recursively compute the distribution of rounding errors at each step of the computation and propagate it through each program instruction. This is done by applying a formalism originally developed by Kozen to formalize the semantics of probabilistic programs. We then discuss an implementation of the model and use it to perform probabilistic range analyses on some benchmarks. △ Less

Submitted 10 December, 2019; v1 submitted 2 December, 2019; originally announced December 2019.

Comments: 9 pages, 6 figures

arXiv:1910.12625 [pdf, other]

LUTNet: Learning FPGA Configurations for Highly Efficient Neural Network Inference

Authors: Erwei Wang, James J. Davis, Peter Y. K. Cheung, George A. Constantinides

Abstract: Research has shown that deep neural networks contain significant redundancy, and thus that high classification accuracy can be achieved even when weights and activations are quantized down to binary values. Network binarization on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates. However, an FPGA's fundamental building block, the K-LUT, i… ▽ More Research has shown that deep neural networks contain significant redundancy, and thus that high classification accuracy can be achieved even when weights and activations are quantized down to binary values. Network binarization on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates. However, an FPGA's fundamental building block, the K-LUT, is capable of implementing far more than an XNOR: it can perform any K-input Boolean operation. Inspired by this observation, we propose LUTNet, an end-to-end hardware-software framework for the construction of area-efficient FPGA-based neural network accelerators using the native LUTs as inference operators. We describe the realization of both unrolled and tiled LUTNet architectures, with the latter facilitating smaller, less power-hungry deployment over the former while sacrificing area and energy efficiency along with throughput. For both varieties, we demonstrate that the exploitation of LUT flexibility allows for far heavier pruning than possible in prior works, resulting in significant area savings while achieving comparable accuracy. Against the state-of-the-art binarized neural network implementation, we achieve up to twice the area efficiency for several standard network models when inferencing popular datasets. We also demonstrate that even greater energy efficiency improvements are obtainable. △ Less

Submitted 2 March, 2020; v1 submitted 23 October, 2019; originally announced October 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1904.00938. Accepted manuscript uploaded 02/03/20. DOA 01/03/20

arXiv:1910.00271 [pdf, other]

ARCHITECT: Arbitrary-precision Hardware with Digit Elision for Efficient Iterative Compute

Authors: He Li, James J. Davis, John Wickerson, George A. Constantinides

Abstract: Many algorithms feature an iterative loop that converges to the result of interest. The numerical operations in such algorithms are generally implemented using finite-precision arithmetic, either fixed- or floating-point, most of which operate least-significant digit first. This results in a fundamental problem: if, after some time, the result has not converged, is this because we have not run the… ▽ More Many algorithms feature an iterative loop that converges to the result of interest. The numerical operations in such algorithms are generally implemented using finite-precision arithmetic, either fixed- or floating-point, most of which operate least-significant digit first. This results in a fundamental problem: if, after some time, the result has not converged, is this because we have not run the algorithm for enough iterations or because the arithmetic in some iterations was insufficiently precise? There is no easy way to answer this question, so users will often over-budget precision in the hope that the answer will always be to run for a few more iterations. We propose a fundamentally new approach: with the appropriate arithmetic able to generate results from most-significant digit first, we show that fixed compute-area hardware can be used to calculate an arbitrary number of algorithmic iterations to arbitrary precision, with both precision and approximant index increasing in lockstep. Consequently, datapaths constructed following our principles demonstrate efficiency over their traditional arithmetic equivalents where the latter's precisions are either under- or over-budgeted for the computation of a result to a particular accuracy. Use of most-significant digit-first arithmetic additionally allows us to declare certain digits to be stable at runtime, avoiding their recalculation in subsequent iterations and thereby increasing performance and decreasing memory footprints. Versus arbitrary-precision iterative solvers without the optimisations we detail herein, we achieve up-to 16$\times$ performance speedups and 1.9x memory savings for the evaluated benchmarks. △ Less

Submitted 1 October, 2019; originally announced October 2019.

arXiv:1905.02438 [pdf, other]

doi 10.1098/rsta.2019.0051

Rethinking Arithmetic for Deep Neural Networks

Authors: George A. Constantinides

Abstract: We consider efficiency in the implementation of deep neural networks. Hardware accelerators are gaining interest as machine learning becomes one of the drivers of high-performance computing. In these accelerators, the directed graph describing a neural network can be implemented as a directed graph describing a Boolean circuit. We make this observation precise, leading naturally to an understandin… ▽ More We consider efficiency in the implementation of deep neural networks. Hardware accelerators are gaining interest as machine learning becomes one of the drivers of high-performance computing. In these accelerators, the directed graph describing a neural network can be implemented as a directed graph describing a Boolean circuit. We make this observation precise, leading naturally to an understanding of practical neural networks as discrete functions, and show that so-called binarised neural networks are functionally complete. In general, our results suggest that it is valuable to consider Boolean circuits as neural networks, leading to the question of which circuit topologies are promising. We argue that continuity is central to generalisation in learning, explore the interaction between data coding, network topology, and node functionality for continuity, and pose some open questions for future research. As a first step to bridging the gap between continuous and Boolean views of neural network accelerators, we present some recent results from our work on LUTNet, a novel Field-Programmable Gate Array inference approach. Finally, we conclude with additional possible fruitful avenues for research bridging the continuous and discrete views of neural networks. △ Less

Submitted 17 September, 2019; v1 submitted 7 May, 2019; originally announced May 2019.

arXiv:1904.00938 [pdf, other]

LUTNet: Rethinking Inference in FPGA Soft Logic

Authors: Erwei Wang, James J. Davis, Peter Y. K. Cheung, George A. Constantinides

Abstract: Research has shown that deep neural networks contain significant redundancy, and that high classification accuracies can be achieved even when weights and activations are quantised down to binary values. Network binarisation on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates. However, an FPGA's fundamental building block, the K-LUT, is c… ▽ More Research has shown that deep neural networks contain significant redundancy, and that high classification accuracies can be achieved even when weights and activations are quantised down to binary values. Network binarisation on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates. However, an FPGA's fundamental building block, the K-LUT, is capable of implementing far more than an XNOR: it can perform any K-input Boolean operation. Inspired by this observation, we propose LUTNet, an end-to-end hardware-software framework for the construction of area-efficient FPGA-based neural network accelerators using the native LUTs as inference operators. We demonstrate that the exploitation of LUT flexibility allows for far heavier pruning than possible in prior works, resulting in significant area savings while achieving comparable accuracy. Against the state-of-the-art binarised neural network implementation, we achieve twice the area efficiency for several standard network models when inferencing popular datasets. We also demonstrate that even greater energy efficiency improvements are obtainable. △ Less

Submitted 1 April, 2019; originally announced April 2019.

Comments: Accepted manuscript uploaded 01/04/19. DOA 03/03/19

arXiv:1901.06955 [pdf, other]

doi 10.1145/3309551

Deep Neural Network Approximation for Custom Hardware: Where We've Been, Where We're Going

Authors: Erwei Wang, James J. Davis, Ruizhe Zhao, Ho-Cheung Ng, Xinyu Niu, Wayne Luk, Peter Y. K. Cheung, George A. Constantinides

Abstract: Deep neural networks have proven to be particularly effective in visual and audio recognition tasks. Existing models tend to be computationally expensive and memory intensive, however, and so methods for hardware-oriented approximation have become a hot topic. Research has shown that custom hardware-based neural network accelerators can surpass their general-purpose processor equivalents in terms… ▽ More Deep neural networks have proven to be particularly effective in visual and audio recognition tasks. Existing models tend to be computationally expensive and memory intensive, however, and so methods for hardware-oriented approximation have become a hot topic. Research has shown that custom hardware-based neural network accelerators can surpass their general-purpose processor equivalents in terms of both throughput and energy efficiency. Application-tailored accelerators, when co-designed with approximation-based network training methods, transform large, dense and computationally expensive networks into small, sparse and hardware-efficient alternatives, increasing the feasibility of network deployment. In this article, we provide a comprehensive evaluation of approximation methods for high-performance network inference along with in-depth discussion of their effectiveness for custom hardware implementation. We also include proposals for future research based on a thorough analysis of current trends. This article represents the first survey providing detailed comparisons of custom hardware accelerators featuring approximation for both convolutional and recurrent neural networks, through which we hope to inspire exciting new developments in the field. △ Less

Submitted 8 July, 2019; v1 submitted 21 January, 2019; originally announced January 2019.

Comments: Accepted manuscript uploaded 21/01/19. DOA 15/01/19

Journal ref: ACM Comput. Surv. 52, 2, Article 40 (May 2019), 39 pages

Showing 1–28 of 28 results for author: Constantinides, G A