-
Vitamin-V: Expanding Open-Source RISC-V Cloud Environments
Authors:
Ramon Canal,
Stefano Di Carlo,
Dimitris Gizopoulos,
Alberto Scionti,
Francesco Lubrano,
Josep-Lluís Berral,
Aaron Call,
Diego Marron,
Konstantinos Nikas,
Dionisios Pnevmatikatos,
Daniel Raho,
Alvise Rigo,
Yannis Papaefstathiou,
José María Arnau,
Angelos Arelakis
Abstract:
Among the key contributions of Vitamin-V (2023-2025 Horizon Europe project), we develop a complete RISC-V open-source software stack for cloud services with comparable performance to the cloud-dominant x86 counterpart. In this paper, we detail the software suites and applications ported plus the three cloud setups under evaluation.
Among the key contributions of Vitamin-V (2023-2025 Horizon Europe project), we develop a complete RISC-V open-source software stack for cloud services with comparable performance to the cloud-dominant x86 counterpart. In this paper, we detail the software suites and applications ported plus the three cloud setups under evaluation.
△ Less
Submitted 12 June, 2024;
originally announced July 2024.
-
A Lightweight, Compiler-Assisted Register File Cache for GPGPU
Authors:
Mojtaba Abaie Shoushtary,
Jose Maria Arnau,
Jordi Tubella Murgadas,
Antonio Gonzalez
Abstract:
Modern GPUs require an enormous register file (RF) to store the context of thousands of active threads. It consumes considerable energy and contains multiple large banks to provide enough throughput. Thus, a RF caching mechanism can significantly improve the performance and energy consumption of the GPUs by avoiding reads from the large banks that consume significant energy and may cause port conf…
▽ More
Modern GPUs require an enormous register file (RF) to store the context of thousands of active threads. It consumes considerable energy and contains multiple large banks to provide enough throughput. Thus, a RF caching mechanism can significantly improve the performance and energy consumption of the GPUs by avoiding reads from the large banks that consume significant energy and may cause port conflicts.
This paper introduces an energy-efficient RF caching mechanism called Malekeh that repurposes an existing component in GPUs' RF to operate as a cache in addition to its original functionality. In this way, Malekeh minimizes the overhead of adding a RF cache to GPUs. Besides, Malekeh leverages an issue scheduling policy that utilizes the reuse distance of the values in the RF cache and is controlled by a dynamic algorithm. The goal is to adapt the issue policy to the runtime program characteristics to maximize the GPU's performance and the hit ratio of the RF cache. The reuse distance is approximated by the compiler using profiling and is used at run time by the proposed caching scheme. We show that Malekeh reduces the number of reads to the RF banks by 46.4% and the dynamic energy of the RF by 28.3%. Besides, it improves performance by 6.1% while adding only 2KB of extra storage per core to the baseline RF of 256KB, which represents a negligible overhead of 0.78%.
△ Less
Submitted 26 October, 2023;
originally announced October 2023.
-
Vitamin-V: Virtual Environment and Tool-boxing for Trustworthy Development of RISC-V based Cloud Services
Authors:
A. Arelakis,
J. M. Arnau,
J. L. Berral,
A. Call,
R. Canal,
S. Di Carlo,
J. Costa,
D. Gizopoulos,
V. Karakostas,
F. Lubrano,
K. Nikas,
Y. Nikolakopoulos,
B. Otero,
G. Papadimitriou,
I. Papaefstathiou,
D. Pnevmatikatos,
D. Raho,
A. Rigo,
E. Rodríguez,
A. Savino,
A. Scionti,
N. Tampouratzis,
A. Torregrosa
Abstract:
Vitamin-V is a 2023-2025 Horizon Europe project that aims to develop a complete RISC-V open-source software stack for cloud services with comparable performance to the cloud-dominant x86 counterpart and a powerful virtual execution environment for software development, validation, verification, and test that considers the relevant RISC-V ISA extensions for cloud deployment.
Vitamin-V is a 2023-2025 Horizon Europe project that aims to develop a complete RISC-V open-source software stack for cloud services with comparable performance to the cloud-dominant x86 counterpart and a powerful virtual execution environment for software development, validation, verification, and test that considers the relevant RISC-V ISA extensions for cloud deployment.
△ Less
Submitted 27 June, 2024; v1 submitted 18 May, 2023;
originally announced May 2023.
-
K-D Bonsai: ISA-Extensions to Compress K-D Trees for Autonomous Driving Tasks
Authors:
Pedro H. E. Becker,
José María Arnau,
Antonio González
Abstract:
Autonomous Driving (AD) systems extensively manipulate 3D point clouds for object detection and vehicle localization. Thereby, efficient processing of 3D point clouds is crucial in these systems. In this work we propose K-D Bonsai, a technique to cut down memory usage during radius search, a critical building block of point cloud processing. K-D Bonsai exploits value similarity in the data structu…
▽ More
Autonomous Driving (AD) systems extensively manipulate 3D point clouds for object detection and vehicle localization. Thereby, efficient processing of 3D point clouds is crucial in these systems. In this work we propose K-D Bonsai, a technique to cut down memory usage during radius search, a critical building block of point cloud processing. K-D Bonsai exploits value similarity in the data structure that holds the point cloud (a k-d tree) to compress the data in memory. K-D Bonsai further compresses the data using a reduced floating-point representation, exploiting the physically limited range of point cloud values. For easy integration into nowadays systems, we implement K-D Bonsai through Bonsai-extensions, a small set of new CPU instructions to compress, decompress, and operate on points. To maintain baseline safety levels, we carefully craft the Bonsai-extensions to detect precision loss due to compression, allowing re-computation in full precision to take place if necessary. Therefore, K-D Bonsai reduces data movement, improving performance and energy efficiency, while guaranteeing baseline accuracy and programmability. We evaluate K-D Bonsai over the euclidean cluster task of Autoware.ai, a state-of-the-art software stack for AD. We achieve an average of 9.26% improvement in end-to-end latency, 12.19% in tail latency, and a reduction of 10.84% in energy consumption. Differently from expensive accelerators proposed in related work, K-D Bonsai improves radius search with minimal area increase (0.36%).
△ Less
Submitted 30 August, 2023; v1 submitted 1 February, 2023;
originally announced February 2023.
-
Exploiting Kernel Compression on BNNs
Authors:
Franyell Silfa,
Jose Maria Arnau,
Antonio González
Abstract:
Binary Neural Networks (BNNs) are showing tremendous success on realistic image classification tasks. Notably, their accuracy is similar to the state-of-the-art accuracy obtained by full-precision models tailored to edge devices. In this regard, BNNs are very amenable to edge devices since they employ 1-bit to store the inputs and weights, and thus, their storage requirements are low. Also, BNNs c…
▽ More
Binary Neural Networks (BNNs) are showing tremendous success on realistic image classification tasks. Notably, their accuracy is similar to the state-of-the-art accuracy obtained by full-precision models tailored to edge devices. In this regard, BNNs are very amenable to edge devices since they employ 1-bit to store the inputs and weights, and thus, their storage requirements are low. Also, BNNs computations are mainly done using xnor and pop-counts operations which are implemented very efficiently using simple hardware structures. Nonetheless, supporting BNNs efficiently on mobile CPUs is far from trivial since their benefits are hindered by frequent memory accesses to load weights and inputs.
In BNNs, a weight or an input is stored using one bit, and aiming to increase storage and computation efficiency, several of them are packed together as a sequence of bits. In this work, we observe that the number of unique sequences representing a set of weights is typically low. Also, we have seen that during the evaluation of a BNN layer, a small group of unique sequences is employed more frequently than others. Accordingly, we propose exploiting this observation by using Huffman Encoding to encode the bit sequences and then using an indirection table to decode them during the BNN evaluation. Also, we propose a clustering scheme to identify the most common sequences of bits and replace the less common ones with some similar common sequences. Hence, we decrease the storage requirements and memory accesses since common sequences are encoded with fewer bits.
We extend a mobile CPU by adding a small hardware structure that can efficiently cache and decode the compressed sequence of bits. We evaluate our scheme using the ReAacNet model with the Imagenet dataset. Our experimental results show that our technique can reduce memory requirement by 1.32x and improve performance by 1.35x.
△ Less
Submitted 1 December, 2022;
originally announced December 2022.
-
E-BATCH: Energy-Efficient and High-Throughput RNN Batching
Authors:
Franyell Silfa,
Jose Maria Arnau,
Antonio Gonzalez
Abstract:
Recurrent Neural Network (RNN) inference exhibits low hardware utilization due to the strict data dependencies across time-steps. Batching multiple requests can increase throughput. However, RNN batching requires a large amount of padding since the batched input sequences may largely differ in length. Schemes that dynamically update the batch every few time-steps avoid padding. However, they requi…
▽ More
Recurrent Neural Network (RNN) inference exhibits low hardware utilization due to the strict data dependencies across time-steps. Batching multiple requests can increase throughput. However, RNN batching requires a large amount of padding since the batched input sequences may largely differ in length. Schemes that dynamically update the batch every few time-steps avoid padding. However, they require executing different RNN layers in a short timespan, decreasing energy efficiency. Hence, we propose E-BATCH, a low-latency and energy-efficient batching scheme tailored to RNN accelerators. It consists of a runtime system and effective hardware support. The runtime concatenates multiple sequences to create large batches, resulting in substantial energy savings. Furthermore, the accelerator notifies it when the evaluation of a sequence is done, so that a new sequence can be immediately added to a batch, thus largely reducing the amount of padding. E-BATCH dynamically controls the number of time-steps evaluated per batch to achieve the best trade-off between latency and energy efficiency for the given hardware platform. We evaluate E-BATCH on top of E-PUR and TPU. In E-PUR, E-BATCH improves throughput by 1.8x and energy-efficiency by 3.6x, whereas in TPU, it improves throughput by 2.1x and energy-efficiency by 1.6x, over the state-of-the-art.
△ Less
Submitted 22 September, 2020;
originally announced September 2020.