Search | arXiv e-print repository

doi 10.1109/HPCA53966.2022.00063

Adaptable Register File Organization for Vector Processors

Authors: Cristóbal Ramírez Lazo, Enrico Reggiani, Carlos Rojas Morales, Roger Figueras Bagué, Luis Alfonso Villa Vargas, Marco Antonio Ramírez Salinas, Mateo Valero Cortés, Osman Sabri Unsal, Adrián Cristal

Abstract: Modern scientific applications are getting more diverse, and the vector lengths in those applications vary widely. Contemporary Vector Processors (VPs) are designed either for short vector lengths, e.g., Fujitsu A64FX with 512-bit ARM SVE vector support, or long vectors, e.g., NEC Aurora Tsubasa with 16Kbits Maximum Vector Length (MVL). Unfortunately, both approaches have drawbacks. On the one han… ▽ More Modern scientific applications are getting more diverse, and the vector lengths in those applications vary widely. Contemporary Vector Processors (VPs) are designed either for short vector lengths, e.g., Fujitsu A64FX with 512-bit ARM SVE vector support, or long vectors, e.g., NEC Aurora Tsubasa with 16Kbits Maximum Vector Length (MVL). Unfortunately, both approaches have drawbacks. On the one hand, short vector length VP designs struggle to provide high efficiency for applications featuring long vectors with high Data Level Parallelism (DLP). On the other hand, long vector VP designs waste resources and underutilize the Vector Register File (VRF) when executing low DLP applications with short vector lengths. Therefore, those long vector VP implementations are limited to a specialized subset of applications, where relatively high DLP must be present to achieve excellent performance with high efficiency. To overcome these limitations, we propose an Adaptable Vector Architecture (AVA) that leads to having the best of both worlds. AVA is designed for short vectors (MVL=16 elements) and is thus area and energy-efficient. However, AVA has the functionality to reconfigure the MVL, thereby allowing to exploit the benefits of having a longer vector (up to 128 elements) microarchitecture when abundant DLP is present. We model AVA on the gem5 simulator and evaluate the performance with six applications taken from the RiVEC Benchmark Suite. To obtain area and power consumption metrics, we model AVA on McPAT for 22nm technology. Our results show that by reconfiguring our small VRF (8KB) plus our novel issue queue scheme, AVA yields a 2X speedup over the default configuration for short vectors. Additionally, AVA shows competitive performance when compared to a long vector VP, while saving 50% of area. △ Less

Submitted 29 May, 2022; v1 submitted 9 November, 2021; originally announced November 2021.

Comments: 28th IEEE International Symposium on High-Performance Computer Architecture (HPCA 2022)

arXiv:2111.01949 [pdf]

doi 10.1145/3422667

A RISC-V Simulator and Benchmark Suite for Designing and Evaluating Vector Architectures

Authors: Cristóbal Ramírez Lazo, César Alejandro Hernández, Oscar Palomar, Osman Sabri Unsal, Marco Antonio Ramírez, Adrían Cristal

Abstract: Vector architectures lack tools for research. Consider the gem5 simulator, which is possibly the leading platform for computer-system architecture research. Unfortunately, gem5 does not have an available distribution that includes a flexible and customizable vector architecture model. In consequence, researchers have to develop their own simulation platform to test their ideas, which consume much… ▽ More Vector architectures lack tools for research. Consider the gem5 simulator, which is possibly the leading platform for computer-system architecture research. Unfortunately, gem5 does not have an available distribution that includes a flexible and customizable vector architecture model. In consequence, researchers have to develop their own simulation platform to test their ideas, which consume much research time. However, once the base simulator platform is developed, another question is the following: Which applications should be tested to perform the experiments? The lack of Vectorized Benchmark Suites is another limitation. To face these problems, this work presents a set of tools for designing and evaluating vector architectures. First, the gem5 simulator was extended to support the execution of RISC-V Vector instructions by adding a parameterizable Vector Architecture model for designers to evaluate different approaches according to the target they pursue. Second, a novel Vectorized Benchmark Suite is presented: a collection composed of seven data-parallel applications from different domains that can be classified according to the modules that are stressed in the vector architecture. Finally, a study of the Vectorized Benchmark Suite executing on the gem5-based Vector Architecture model is highlighted. This suite is the first in its category that covers the different possible usage scenarios that may occur within different vector architecture designs such as embedded systems, mainly focused on short vectors, or High-Performance-Computing (HPC), usually designed for large vectors. △ Less

Submitted 29 October, 2021; originally announced November 2021.

Comments: ACM Transactions on Architecture and Code Optimization, Volume 17, Issue 4, December 2020, Article No.38

arXiv:2005.04737 [pdf, other]

Power and Accuracy of Multi-Layer Perceptrons (MLPs) under Reduced-voltage FPGA BRAMs Operation

Authors: Behzad Salami, Osman Unsal, Adrian Cristal

Abstract: In this paper, we exploit the aggressive supply voltage underscaling technique in Block RAMs (BRAMs) of Field Programmable Gate Arrays (FPGAs) to improve the energy efficiency of Multi-Layer Perceptrons (MLPs). Additionally, we evaluate and improve the resilience of this accelerator. Through experiments on several representative FPGA fabrics, we observe that until a minimum safe voltage level, i.e… ▽ More In this paper, we exploit the aggressive supply voltage underscaling technique in Block RAMs (BRAMs) of Field Programmable Gate Arrays (FPGAs) to improve the energy efficiency of Multi-Layer Perceptrons (MLPs). Additionally, we evaluate and improve the resilience of this accelerator. Through experiments on several representative FPGA fabrics, we observe that until a minimum safe voltage level, i.e., Vmin the MLP accuracy is not affected. This safe region involves a large voltage guardband. Also, it involves a narrower voltage region where faults start to appear in memories due to the increased circuit delay, but these faults are masked by MLP, and thus, its accuracy is not affected. However, further undervolting causes significant accuracy loss as a result of the fast-increasing high fault rates. Based on the characterization of these undervolting faults, we propose fault mitigation techniques that can effectively improve the resilience behavior of such accelerator. Our evaluation is based on four FPGA platforms. On average, we achieve >90% energy saving with a negligible accuracy loss of up to 0.1%. △ Less

Submitted 10 May, 2020; originally announced May 2020.

arXiv:2001.00053 [pdf, other]

On the Resilience of Deep Learning for Reduced-voltage FPGAs

Authors: Kamyar Givaki, Behzad Salami, Reza Hojabr, S. M. Reza Tayaranian, Ahmad Khonsari, Dara Rahmati, Saeid Gorgin, Adrian Cristal, Osman S. Unsal

Abstract: Deep Neural Networks (DNNs) are inherently computation-intensive and also power-hungry. Hardware accelerators such as Field Programmable Gate Arrays (FPGAs) are a promising solution that can satisfy these requirements for both embedded and High-Performance Computing (HPC) systems. In FPGAs, as well as CPUs and GPUs, aggressive voltage scaling below the nominal level is an effective technique for p… ▽ More Deep Neural Networks (DNNs) are inherently computation-intensive and also power-hungry. Hardware accelerators such as Field Programmable Gate Arrays (FPGAs) are a promising solution that can satisfy these requirements for both embedded and High-Performance Computing (HPC) systems. In FPGAs, as well as CPUs and GPUs, aggressive voltage scaling below the nominal level is an effective technique for power dissipation minimization. Unfortunately, bit-flip faults start to appear as the voltage is scaled down closer to the transistor threshold due to timing issues, thus creating a resilience issue. This paper experimentally evaluates the resilience of the training phase of DNNs in the presence of voltage underscaling related faults of FPGAs, especially in on-chip memories. Toward this goal, we have experimentally evaluated the resilience of LeNet-5 and also a specially designed network for CIFAR-10 dataset with different activation functions of Rectified Linear Unit (Relu) and Hyperbolic Tangent (Tanh). We have found that modern FPGAs are robust enough in extremely low-voltage levels and that low-voltage related faults can be automatically masked within the training iterations, so there is no need for costly software- or hardware-oriented fault mitigation techniques like ECC. Approximately 10% more training iterations are needed to fill the gap in the accuracy. This observation is the result of the relatively low rate of undervolting faults, i.e., <0.1\%, measured on real FPGA fabrics. We have also increased the fault rate significantly for the LeNet-5 network by randomly generated fault injection campaigns and observed that the training accuracy starts to degrade. When the fault rate increases, the network with Tanh activation function outperforms the one with Relu in terms of accuracy, e.g., when the fault rate is 30% the accuracy difference is 4.92%. △ Less

Submitted 26 December, 2019; originally announced January 2020.

arXiv:1912.01563 [pdf, other]

LEGaTO: Low-Energy, Secure, and Resilient Toolset for Heterogeneous Computing

Authors: B. Salami, K. Parasyris, A. Cristal, O. Unsal, X. Martorell, P. Carpenter, R. De La Cruz, L. Bautista, D. Jimenez, C. Alvarez, S. Nabavi, S. Madonar, M. Pericas, P. Trancoso, M. Abduljabbar, J. Chen, P. N. Soomro, M Manivannan, M. Berge, S. Krupop, F. Klawonn, Al Mekhlafi, S. May, T. Becker, G. Gaydadjiev , et al. (20 additional authors not shown)

Abstract: The LEGaTO project leverages task-based programming models to provide a software ecosystem for Made in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC, balanced with the security and resilience challenges. LEGaTO is an ongoing three-year EU H2020 project started in… ▽ More The LEGaTO project leverages task-based programming models to provide a software ecosystem for Made in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC, balanced with the security and resilience challenges. LEGaTO is an ongoing three-year EU H2020 project started in December 2017. △ Less

Submitted 1 December, 2019; originally announced December 2019.

Comments: 6 pages, 9 figures

arXiv:1912.01556 [pdf]

A Novel FPGA-Based High Throughput Accelerator For Binary Search Trees

Authors: Oyku Melikoglu, Oguz Ergin, Behzad Salami, Julian Pavon, Osman Unsal, Adrian Cristal

Abstract: This paper presents a deeply pipelined and massively parallel Binary Search Tree (BST) accelerator for Field Programmable Gate Arrays (FPGAs). Our design relies on the extremely parallel on-chip memory, or Block RAMs (BRAMs) architecture of FPGAs. To achieve significant throughput for the search operation on BST, we present several novel mechanisms including tree duplication as well as horizontal,… ▽ More This paper presents a deeply pipelined and massively parallel Binary Search Tree (BST) accelerator for Field Programmable Gate Arrays (FPGAs). Our design relies on the extremely parallel on-chip memory, or Block RAMs (BRAMs) architecture of FPGAs. To achieve significant throughput for the search operation on BST, we present several novel mechanisms including tree duplication as well as horizontal, duplicated, and hybrid (horizontal-vertical) tree partitioning. Also, we present efficient techniques to decrease the stalling rates that can occur during the parallel tree search. By combining these techniques and implementations on Xilinx Virtex-7 VC709 platform, we achieve up to 8X throughput improvement gain in comparison to the baseline implementation, i.e., a fully-pipelined FPGA-based accelerator. △ Less

Submitted 1 December, 2019; originally announced December 2019.

Comments: 8 pages, 9 figures

arXiv:1806.09679 [pdf, other]

On the Resilience of RTL NN Accelerators: Fault Characterization and Mitigation

Authors: Behzad Salami, Osman Unsal, Adrian Cristal

Abstract: Machine Learning (ML) is making a strong resurgence in tune with the massive generation of unstructured data which in turn requires massive computational resources. Due to the inherently compute- and power-intensive structure of Neural Networks (NNs), hardware accelerators emerge as a promising solution. However, with technology node scaling below 10nm, hardware accelerators become more susceptibl… ▽ More Machine Learning (ML) is making a strong resurgence in tune with the massive generation of unstructured data which in turn requires massive computational resources. Due to the inherently compute- and power-intensive structure of Neural Networks (NNs), hardware accelerators emerge as a promising solution. However, with technology node scaling below 10nm, hardware accelerators become more susceptible to faults, which in turn can impact the NN accuracy. In this paper, we study the resilience aspects of Register-Transfer Level (RTL) model of NN accelerators, in particular, fault characterization and mitigation. By following a High-Level Synthesis (HLS) approach, first, we characterize the vulnerability of various components of RTL NN. We observed that the severity of faults depends on both i) application-level specifications, i.e., NN data (inputs, weights, or intermediate), NN layers, and NN activation functions, and ii) architectural-level specifications, i.e., data representation model and the parallelism degree of the underlying accelerator. Second, motivated by characterization results, we present a low-overhead fault mitigation technique that can efficiently correct bit flips, by 47.3% better than state-of-the-art methods. △ Less

Submitted 14 June, 2018; originally announced June 2018.

Comments: 8 pages, 6 figures

MSC Class: 68T01

arXiv:1804.05267 [pdf, other]

Low-Precision Floating-Point Schemes for Neural Network Training

Authors: Marc Ortiz, Adrián Cristal, Eduard Ayguadé, Marc Casas

Abstract: The use of low-precision fixed-point arithmetic along with stochastic rounding has been proposed as a promising alternative to the commonly used 32-bit floating point arithmetic to enhance training neural networks training in terms of performance and energy efficiency. In the first part of this paper, the behaviour of the 12-bit fixed-point arithmetic when training a convolutional neural network w… ▽ More The use of low-precision fixed-point arithmetic along with stochastic rounding has been proposed as a promising alternative to the commonly used 32-bit floating point arithmetic to enhance training neural networks training in terms of performance and energy efficiency. In the first part of this paper, the behaviour of the 12-bit fixed-point arithmetic when training a convolutional neural network with the CIFAR-10 dataset is analysed, showing that such arithmetic is not the most appropriate for the training phase. After that, the paper presents and evaluates, under the same conditions, alternative low-precision arithmetics, starting with the 12-bit floating-point arithmetic. These two representations are then leveraged using local scaling in order to increase accuracy and get closer to the baseline 32-bit floating-point arithmetic. Finally, the paper introduces a simplified model in which both the outputs and the gradients of the neural networks are constrained to power-of-two values, just using 7 bits for their representation. The evaluation demonstrates a minimal loss in accuracy for the proposed Power-of-Two neural network, avoiding the use of multiplications and divisions and thereby, significantly reducing the training time as well as the energy consumption and memory requirements during the training and inference phases. △ Less

Submitted 14 April, 2018; originally announced April 2018.

Comments: 16 pages, 9 figures and 4 tables

ACM Class: I.2.6; I.5

Showing 1–8 of 8 results for author: Cristal, A