Search | arXiv e-print repository

Siracusa: A 16 nm Heterogenous RISC-V SoC for Extended Reality with At-MRAM Neural Engine

Authors: Arpan Suravi Prasad, Moritz Scherer, Francesco Conti, Davide Rossi, Alfio Di Mauro, Manuel Eggimann, Jorge Tómas Gómez, Ziyun Li, Syed Shakib Sarwar, Zhao Wang, Barbara De Salvo, Luca Benini

Abstract: Extended reality (XR) applications are Machine Learning (ML)-intensive, featuring deep neural networks (DNNs) with millions of weights, tightly latency-bound (10-20 ms end-to-end), and power-constrained (low tens of mW average power). While ML performance and efficiency can be achieved by introducing neural engines within low-power systems-on-chip (SoCs), system-level power for nontrivial DNNs dep… ▽ More Extended reality (XR) applications are Machine Learning (ML)-intensive, featuring deep neural networks (DNNs) with millions of weights, tightly latency-bound (10-20 ms end-to-end), and power-constrained (low tens of mW average power). While ML performance and efficiency can be achieved by introducing neural engines within low-power systems-on-chip (SoCs), system-level power for nontrivial DNNs depends strongly on the energy of non-volatile memory (NVM) access for network weights. This work introduces Siracusa, a near-sensor heterogeneous SoC for next-generation XR devices manufactured in 16 nm CMOS. Siracusa couples an octa-core cluster of RISC-V digital signal processing cores with a novel tightly-coupled "At-Memory" integration between a state-of-the-art digital neural engine called N-EUREKA and an on-chip NVM based on magnetoresistive memory(MRAM), achieving 1.7x higher throughput and 3x better energy efficiency than XR SoCs using NVM as background memory. The fabricated SoC prototype achieves an area efficiency of 65.2 GOp/s/mm2 and a peak energy efficiency of 8.84 TOp/J for DNN inference while supporting complex heterogeneous application workloads, which combine ML with conventional signal processing and control. △ Less

Submitted 14 April, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

Comments: Final accepted manuscript pre-print submitted to the IEEE Journal of Solid-State Circuits

arXiv:2305.18371 [pdf, other]

ColibriUAV: An Ultra-Fast, Energy-Efficient Neuromorphic Edge Processing UAV-Platform with Event-Based and Frame-Based Cameras

Authors: Sizhen Bian, Lukas Schulthess, Georg Rutishauser, Alfio Di Mauro, Luca Benini, Michele Magno

Abstract: The interest in dynamic vision sensor (DVS)-powered unmanned aerial vehicles (UAV) is raising, especially due to the microsecond-level reaction time of the bio-inspired event sensor, which increases robustness and reduces latency of the perception tasks compared to a RGB camera. This work presents ColibriUAV, a UAV platform with both frame-based and event-based cameras interfaces for efficient per… ▽ More The interest in dynamic vision sensor (DVS)-powered unmanned aerial vehicles (UAV) is raising, especially due to the microsecond-level reaction time of the bio-inspired event sensor, which increases robustness and reduces latency of the perception tasks compared to a RGB camera. This work presents ColibriUAV, a UAV platform with both frame-based and event-based cameras interfaces for efficient perception and near-sensor processing. The proposed platform is designed around Kraken, a novel low-power RISC-V System on Chip with two hardware accelerators targeting spiking neural networks and deep ternary neural networks.Kraken is capable of efficiently processing both event data from a DVS camera and frame data from an RGB camera. A key feature of Kraken is its integrated, dedicated interface with a DVS camera. This paper benchmarks the end-to-end latency and power efficiency of the neuromorphic and event-based UAV subsystem, demonstrating state-of-the-art event data with a throughput of 7200 frames of events per second and a power consumption of 10.7 \si{\milli\watt}, which is over 6.6 times faster and a hundred times less power-consuming than the widely-used data reading approach through the USB interface. The overall sensing and processing power consumption is below 50 mW, achieving latency in the milliseconds range, making the platform suitable for low-latency autonomous nano-drones as well. △ Less

Submitted 27 May, 2023; originally announced May 2023.

arXiv:2305.08415 [pdf, other]

doi 10.1109/JSSC.2023.3318301

Marsellus: A Heterogeneous RISC-V AI-IoT End-Node SoC with 2-to-8b DNN Acceleration and 30%-Boost Adaptive Body Biasing

Authors: Francesco Conti, Gianna Paulin, Angelo Garofalo, Davide Rossi, Alfio Di Mauro, Georg Rutishauser, Gianmarco Ottavi, Manuel Eggimann, Hayate Okuhara, Luca Benini

Abstract: Emerging Artificial Intelligence-enabled Internet-of-Things (AI-IoT) System-on-a-Chip (SoC) for augmented reality, personalized healthcare, and nano-robotics need to run many diverse tasks within a power envelope of a few tens of mW over a wide range of operating conditions: compute-intensive but strongly quantized Deep Neural Network (DNN) inference, as well as signal processing and control requi… ▽ More Emerging Artificial Intelligence-enabled Internet-of-Things (AI-IoT) System-on-a-Chip (SoC) for augmented reality, personalized healthcare, and nano-robotics need to run many diverse tasks within a power envelope of a few tens of mW over a wide range of operating conditions: compute-intensive but strongly quantized Deep Neural Network (DNN) inference, as well as signal processing and control requiring high-precision floating-point. We present Marsellus, an all-digital heterogeneous SoC for AI-IoT end-nodes fabricated in GlobalFoundries 22nm FDX that combines 1) a general-purpose cluster of 16 RISC-V Digital Signal Processing (DSP) cores attuned for the execution of a diverse range of workloads exploiting 4-bit and 2-bit arithmetic extensions (XpulpNN), combined with fused MAC&LOAD operations and floating-point support; 2) a 2-8bit Reconfigurable Binary Engine (RBE) to accelerate 3x3 and 1x1 (pointwise) convolutions in DNNs; 3) a set of On-Chip Monitoring (OCM) blocks connected to an Adaptive Body Biasing (ABB) generator and a hardware control loop, enabling on-the-fly adaptation of transistor threshold voltages. Marsellus achieves up to 180 Gop/s or 3.32 Top/s/W on 2-bit precision arithmetic in software, and up to 637 Gop/s or 12.4 Top/s/W on hardware-accelerated DNN layers. △ Less

Submitted 28 November, 2023; v1 submitted 15 May, 2023; originally announced May 2023.

Comments: Post-print accepted by IEEE Journal of Solid-State Circuits. Fixed metadata (was missing one co-author), added DOI of IEEE JSSC

arXiv:2302.07957 [pdf, other]

ColibriES: A Milliwatts RISC-V Based Embedded System Leveraging Neuromorphic and Neural Networks Hardware Accelerators for Low-Latency Closed-loop Control Applications

Authors: Georg Rutishauser, Robin Hunziker, Alfio Di Mauro, Sizhen Bian, Luca Benini, Michele Magno

Abstract: End-to-end event-based computation has the potential to push the envelope in latency and energy efficiency for edge AI applications. Unfortunately, event-based sensors (e.g., DVS cameras) and neuromorphic spike-based processors (e.g., Loihi) have been designed in a decoupled fashion, thereby missing major streamlining opportunities. This paper presents ColibriES, the first-ever neuromorphic hardwa… ▽ More End-to-end event-based computation has the potential to push the envelope in latency and energy efficiency for edge AI applications. Unfortunately, event-based sensors (e.g., DVS cameras) and neuromorphic spike-based processors (e.g., Loihi) have been designed in a decoupled fashion, thereby missing major streamlining opportunities. This paper presents ColibriES, the first-ever neuromorphic hardware embedded system platform with dedicated event-sensor interfaces and full processing pipelines. ColibriES includes event and frame interfaces and data processing, aiming at efficient and long-life embedded systems in edge scenarios. ColibriES is based on the Kraken system-on-chip and contains a heterogeneous parallel ultra-low power (PULP) processor, frame-based and event-based camera interfaces, and two hardware accelerators for the computation of both event-based spiking neural networks and frame-based ternary convolutional neural networks. This paper explores and accurately evaluates the performance of event data processing on the example of gesture recognition on ColibriES, as the first step of full-system evaluation. In our experiments, we demonstrate a chip energy consumption of 7.7 \si{\milli\joule} and latency of 164.5 \si{\milli\second} of each inference with the DVS Gesture event data set as an example for closed-loop data processing, showcasing the potential of ColibriES for battery-powered applications such as wearable devices and UAVs that require low-latency closed-loop control. △ Less

Submitted 15 February, 2023; originally announced February 2023.

arXiv:2212.00688 [pdf, other]

TCN-CUTIE: A 1036 TOp/s/W, 2.72 uJ/Inference, 12.2 mW All-Digital Ternary Accelerator in 22 nm FDX Technology

Authors: Moritz Scherer, Alfio Di Mauro, Tim Fischer, Georg Rutishauser, Luca Benini

Abstract: Tiny Machine Learning (TinyML) applications impose uJ/Inference constraints, with a maximum power consumption of tens of mW. It is extremely challenging to meet these requirements at a reasonable accuracy level. This work addresses the challenge with a flexible, fully digital Ternary Neural Network (TNN) accelerator in a RISC-V-based System-on-Chip (SoC). Besides supporting Ternary Convolutional N… ▽ More Tiny Machine Learning (TinyML) applications impose uJ/Inference constraints, with a maximum power consumption of tens of mW. It is extremely challenging to meet these requirements at a reasonable accuracy level. This work addresses the challenge with a flexible, fully digital Ternary Neural Network (TNN) accelerator in a RISC-V-based System-on-Chip (SoC). Besides supporting Ternary Convolutional Neural Networks, we introduce extensions to the accelerator design that enable the processing of time-dilated Temporal Convolutional Neural Networks (TCNs). The design achieves 5.5 uJ/Inference, 12.2 mW, 8000 Inferences/sec at 0.5 V for a Dynamic Vision Sensor (DVS) based TCN, and an accuracy of 94.5 % and 2.72 uJ/Inference, 12.2 mW, 3200 Inferences/sec at 0.5 V for a non-trivial 9-layer, 96 channels-per-layer convolutional network with CIFAR-10 accuracy of 86 %. The peak energy efficiency is 1036 TOp/s/W, outperforming the state-of-the-art silicon-proven TinyML quantized accelerators by 1.67x while achieving competitive accuracy. △ Less

Submitted 1 December, 2022; originally announced December 2022.

Comments: Accepted at IEEE MICRO Journal

arXiv:2210.06287 [pdf, other]

doi 10.1109/AICAS54282.2022.9869846

An Energy-Efficient Spiking Neural Network for Finger Velocity Decoding for Implantable Brain-Machine Interface

Authors: Jiawei Liao, Lars Widmer, Xiaying Wang, Alfio Di Mauro, Samuel R. Nason-Tomaszewski, Cynthia A. Chestek, Luca Benini, Taekwang Jang

Abstract: Brain-machine interfaces (BMIs) are promising for motor rehabilitation and mobility augmentation. High-accuracy and low-power algorithms are required to achieve implantable BMI systems. In this paper, we propose a novel spiking neural network (SNN) decoder for implantable BMI regression tasks. The SNN is trained with enhanced spatio-temporal backpropagation to fully leverage its ability in handlin… ▽ More Brain-machine interfaces (BMIs) are promising for motor rehabilitation and mobility augmentation. High-accuracy and low-power algorithms are required to achieve implantable BMI systems. In this paper, we propose a novel spiking neural network (SNN) decoder for implantable BMI regression tasks. The SNN is trained with enhanced spatio-temporal backpropagation to fully leverage its ability in handling temporal problems. The proposed SNN decoder achieves the same level of correlation coefficient as the state-of-the-art ANN decoder in offline finger velocity decoding tasks, while it requires only 6.8% of the computation operations and 9.4% of the memory access. △ Less

Submitted 7 October, 2022; originally announced October 2022.

Journal ref: 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), 2022, pp. 134-137

arXiv:2209.01065 [pdf, other]

Kraken: A Direct Event/Frame-Based Multi-sensor Fusion SoC for Ultra-Efficient Visual Processing in Nano-UAVs

Authors: Alfio Di Mauro, Moritz Scherer, Davide Rossi, Luca Benini

Abstract: Small-size unmanned aerial vehicles (UAV) have the potential to dramatically increase safety and reduce cost in applications like critical infrastructure maintenance and post-disaster search and rescue. Many scenarios require UAVs to shrink toward nano and pico-size form factors. The key open challenge to achieve true autonomy on Nano-UAVs is to run complex visual tasks like object detection, trac… ▽ More Small-size unmanned aerial vehicles (UAV) have the potential to dramatically increase safety and reduce cost in applications like critical infrastructure maintenance and post-disaster search and rescue. Many scenarios require UAVs to shrink toward nano and pico-size form factors. The key open challenge to achieve true autonomy on Nano-UAVs is to run complex visual tasks like object detection, tracking, navigation and obstacle avoidance fully on board, at high speed and robustness, under tight payload and power constraints. With the Kraken SoC, fabricated in 22nm FDX technology, we demonstrate a multi-visual-sensor capability exploiting both event-based and BW/RGB imagers, combining their output for multi-functional visual tasks previously impossible on a single low-power chip for Nano-UAVs. Kraken is an ultra-low-power, heterogeneous SoC architecture integrating three acceleration engines and a vast set of peripherals to enable efficient interfacing with standard frame-based sensors and novel event-based DVS. Kraken enables highly sparse event-driven sub-uJ/inf SNN inference on a dedicated neuromorphic energy-proportional accelerator. Moreover, it can perform frame-based inference by combining a 1.8TOp\s\W 8-cores RISC-V processor cluster with mixed-precision DNN extensions with a 1036TOp\s\W} TNN accelerator. △ Less

Submitted 18 August, 2022; originally announced September 2022.

arXiv:2204.10687 [pdf, other]

SNE: an Energy-Proportional Digital Accelerator for Sparse Event-Based Convolutions

Authors: Alfio Di Mauro, Arpan Suravi Prasad, Zhikai Huang, Matteo Spallanzani, Francesco Conti, Luca Benini

Abstract: Event-based sensors are drawing increasing attention due to their high temporal resolution, low power consumption, and low bandwidth. To efficiently extract semantically meaningful information from sparse data streams produced by such sensors, we present a 4.5TOP/s/W digital accelerator capable of performing 4-bits-quantized event-based convolutional neural networks (eCNN). Compared to standard co… ▽ More Event-based sensors are drawing increasing attention due to their high temporal resolution, low power consumption, and low bandwidth. To efficiently extract semantically meaningful information from sparse data streams produced by such sensors, we present a 4.5TOP/s/W digital accelerator capable of performing 4-bits-quantized event-based convolutional neural networks (eCNN). Compared to standard convolutional engines, our accelerator performs a number of operations proportional to the number of events contained into the input data stream, ultimately achieving a high energy-to-information processing proportionality. On the IBM-DVS-Gesture dataset, we report 80uJ/inf to 261uJ/inf, respectively, when the input activity is 1.2% and 4.9%. Our accelerator consumes 0.221pJ/SOP, to the best of our knowledge it is the lowest energy/OP reported on a digital neuromorphic engine. △ Less

Submitted 29 April, 2022; v1 submitted 22 April, 2022; originally announced April 2022.

Comments: Accepted at DATE22

arXiv:2201.08656 [pdf, other]

doi 10.1109/TCSI.2023.3254810

Dustin: A 16-Cores Parallel Ultra-Low-Power Cluster with 2b-to-32b Fully Flexible Bit-Precision and Vector Lockstep Execution Mode

Authors: Gianmarco Ottavi, Angelo Garofalo, Giuseppe Tagliavini, Francesco Conti, Alfio Di Mauro, Luca Benini, Davide Rossi

Abstract: Computationally intensive algorithms such as Deep Neural Networks (DNNs) are becoming killer applications for edge devices. Porting heavily data-parallel algorithms on resource-constrained and battery-powered devices poses several challenges related to memory footprint, computational throughput, and energy efficiency. Low-bitwidth and mixed-precision arithmetic have been proven to be valid strateg… ▽ More Computationally intensive algorithms such as Deep Neural Networks (DNNs) are becoming killer applications for edge devices. Porting heavily data-parallel algorithms on resource-constrained and battery-powered devices poses several challenges related to memory footprint, computational throughput, and energy efficiency. Low-bitwidth and mixed-precision arithmetic have been proven to be valid strategies for tackling these problems. We present Dustin, a fully programmable compute cluster integrating 16 RISC-V cores capable of 2- to 32-bit arithmetic and all possible mixed-precision permutations. In addition to a conventional Multiple-Instruction Multiple-Data (MIMD) processing paradigm, Dustin introduces a Vector Lockstep Execution Mode (VLEM) to minimize power consumption in highly data-parallel kernels. In VLEM, a single leader core fetches instructions and broadcasts them to the 15 follower cores. Clock gating Instruction Fetch (IF) stages and private caches of the follower cores leads to 38\% power reduction with minimal performance overhead (<3%). The cluster, implemented in 65 nm CMOS technology, achieves a peak performance of 58 GOPS and a peak efficiency of 1.15 TOPS/W. △ Less

Submitted 16 March, 2023; v1 submitted 21 January, 2022; originally announced January 2022.

Comments: 13 pages, 17 figures, 2 tables, Journal

arXiv:2110.09101 [pdf, other]

doi 10.1109/JSSC.2021.3114881

Vega: A 10-Core SoC for IoT End-Nodes with DNN Acceleration and Cognitive Wake-Up From MRAM-Based State-Retentive Sleep Mode

Authors: Davide Rossi, Francesco Conti, Manuel Eggimann, Alfio Di Mauro, Giuseppe Tagliavini, Stefan Mach, Marco Guermandi, Antonio Pullini, Igor Loi, Jie Chen, Eric Flamand, Luca Benini

Abstract: The Internet-of-Things requires end-nodes with ultra-low-power always-on capability for a long battery lifetime, as well as high performance, energy efficiency, and extreme flexibility to deal with complex and fast-evolving near-sensor analytics algorithms (NSAAs). We present Vega, an IoT end-node SoC capable of scaling from a 1.7 $\mathrmμ$W fully retentive cognitive sleep mode up to 32.2 GOPS (@… ▽ More The Internet-of-Things requires end-nodes with ultra-low-power always-on capability for a long battery lifetime, as well as high performance, energy efficiency, and extreme flexibility to deal with complex and fast-evolving near-sensor analytics algorithms (NSAAs). We present Vega, an IoT end-node SoC capable of scaling from a 1.7 $\mathrmμ$W fully retentive cognitive sleep mode up to 32.2 GOPS (@ 49.4 mW) peak performance on NSAAs, including mobile DNN inference, exploiting 1.6 MB of state-retentive SRAM, and 4 MB of non-volatile MRAM. To meet the performance and flexibility requirements of NSAAs, the SoC features 10 RISC-V cores: one core for SoC and IO management and a 9-cores cluster supporting multi-precision SIMD integer and floating-point computation. Vega achieves SoA-leading efficiency of 615 GOPS/W on 8-bit INT computation (boosted to 1.3TOPS/W for 8-bit DNN inference with hardware acceleration). On floating-point (FP) compuation, it achieves SoA-leading efficiency of 79 and 129 GFLOPS/W on 32- and 16-bit FP, respectively. Two programmable machine-learning (ML) accelerators boost energy efficiency in cognitive sleep and active states, respectively. △ Less

Submitted 18 October, 2021; originally announced October 2021.

Comments: 13 pages, 11 figures, 8 tables, journal paper

arXiv:2010.04566 [pdf, other]

An Energy-Efficient Low-Voltage Swing Transceiver for mW-Range IoT End-Nodes

Authors: Hayate Okuhara, Ahmed Elnaqib, Davide Rossi, Alfio Di Mauro, Philipp Mayer, Pierpaolo Palestri, Luca Benini

Abstract: As the Internet-of-Things (IoT) applications become more and more pervasive, IoT end nodes are requiring more and more computational power within a few mW of power envelope, coupled with high-speed and energy-efficient inter-chip communication to deal with the growing input/output and memory bandwidth for emerging near-sensor analytics applications. While traditional interfaces such as SPI cannot… ▽ More As the Internet-of-Things (IoT) applications become more and more pervasive, IoT end nodes are requiring more and more computational power within a few mW of power envelope, coupled with high-speed and energy-efficient inter-chip communication to deal with the growing input/output and memory bandwidth for emerging near-sensor analytics applications. While traditional interfaces such as SPI cannot cope with these tight requirements, low-voltage swing transceivers can tackle this challenge thanks to their capability to achieve several Gbps of bandwidth at extremely low power. However, recent research on high-speed serial links addressed this challenge only partially, proposing only partial or stand-alone designs, and not addressing their integration in real systems and the related implications. In this paper, we present for the first time a complete design and system-level architecture of a low-voltage swing transceiver integrated within a low-power (mW range) IoT end-node processors, and we compare it with existing microcontroller interfaces. The transceiver, implemented in a commercial 65-nm CMOS technology achieves 10.2x higher energy efficiency at 15.7x higher performance than traditional microcontroller peripherals (single lane). △ Less

Submitted 9 October, 2020; originally announced October 2020.

Comments: ISCAS2020

arXiv:2007.13667 [pdf, other]

doi 10.1016/j.vlsi.2019.12.006

Performance-Aware Predictive-Model-Based On-Chip Body-Bias Regulation Strategy for an ULP Multi-Core Cluster in 28nm UTBB FD-SOI

Authors: Alfio Di Mauro, Davide Rossi, Antonio Pullini, Philippe Flatresse, Luca Benini

Abstract: The performance and reliability of Ultra-Low-Power (ULP) computing platforms are adversely affected by environmental temperature and process variations. Mitigating the effect of these phenomena becomes crucial when these devices operate near-threshold, due to the magnification of process variations and to the strong temperature inversion effect that affects advanced technology nodes in low-voltage… ▽ More The performance and reliability of Ultra-Low-Power (ULP) computing platforms are adversely affected by environmental temperature and process variations. Mitigating the effect of these phenomena becomes crucial when these devices operate near-threshold, due to the magnification of process variations and to the strong temperature inversion effect that affects advanced technology nodes in low-voltage corners, which causes huge overhead due to margining for timing closure. Supporting an extended range of reverse and forward body-bias, UTBB FD-SOI technology provides a powerful knob to compensate for such variations. In this work we propose a methodology to maximize energy efficiency at run-time exploiting body biasing on a ULP platform operating near-threshold. The proposed method relies on on-line performance measurements by means of Process Monitoring Blocks (PMBs) coupled with an on-chip low-power body bias generator. We correlate the measurement performed by the PMBs to the maximum achievable frequency of the system, deriving a predictive model able to estimate it with an error of 9.7% at 0.7V. To minimize the effect of process variations we propose a calibration procedure that allows to use a PMB model affected by only the temperature-induced error, which reduces the frequency estimation error by 2.4x (from 9.7% to 4%). We finally propose a controller architecture relying on the derived models to automatically regulate at run-time the body bias voltage. We demonstrate that adjusting the body bias voltage against environmental temperature variations leads up to 2X reduction in the leakage power and a 15% improvement on the global energy consumption when the system operates at 0.7V and 170MHz △ Less

Submitted 27 July, 2020; originally announced July 2020.

Journal ref: Integration, Volume 72, 2020, Pages 194-207

arXiv:2007.08952 [pdf, other]

Always-On 674uW @ 4GOP/s Error Resilient Binary Neural Networks with Aggressive SRAM Voltage Scaling on a 22nm IoT End-Node

Authors: Alfio Di Mauro, Francesco Conti, Pasquale Davide Schiavone, Davide Rossi, Luca Benini

Abstract: Binary Neural Networks (BNNs) have been shown to be robust to random bit-level noise, making aggressive voltage scaling attractive as a power-saving technique for both logic and SRAMs. In this work, we introduce the first fully programmable IoT end-node system-on-chip (SoC) capable of executing software-defined, hardware-accelerated BNNs at ultra-low voltage. Our SoC exploits a hybrid memory schem… ▽ More Binary Neural Networks (BNNs) have been shown to be robust to random bit-level noise, making aggressive voltage scaling attractive as a power-saving technique for both logic and SRAMs. In this work, we introduce the first fully programmable IoT end-node system-on-chip (SoC) capable of executing software-defined, hardware-accelerated BNNs at ultra-low voltage. Our SoC exploits a hybrid memory scheme where error-vulnerable SRAMs are complemented by reliable standard-cell memories to safely store critical data under aggressive voltage scaling. On a prototype in 22nm FDX technology, we demonstrate that both the logic and SRAM voltage can be dropped to 0.5Vwithout any accuracy penalty on a BNN trained for the CIFAR-10 dataset, improving energy efficiency by 2.2X w.r.t. nominal conditions. Furthermore, we show that the supply voltage can be dropped to 0.42V (50% of nominal) while keeping more than99% of the nominal accuracy (with a bit error rate ~1/1000). In this operating point, our prototype performs 4Gop/s (15.4Inference/s on the CIFAR-10 dataset) by computing up to 13binary ops per pJ, achieving 22.8 Inference/s/mW while keeping within a peak power envelope of 674uW - low enough to enable always-on operation in ultra-low power smart cameras, long-lifetime environmental sensors, and insect-sized pico-drones. △ Less

Submitted 17 July, 2020; originally announced July 2020.

Comments: Submitted to ISICAS2020 journal special issue

arXiv:2006.14256 [pdf, other]

Arnold: an eFPGA-Augmented RISC-V SoC for Flexible and Low-Power IoT End-Nodes

Authors: Pasquale Davide Schiavone, Davide Rossi, Alfio Di Mauro, Frank Gurkaynak, Timothy Saxe, Mao Wang, Ket Chong Yap, Luca Benini

Abstract: A wide range of Internet of Things (IoT) applications require powerful, energy-efficient and flexible end-nodes to acquire data from multiple sources, process and distill the sensed data through near-sensor data analytics algorithms, and transmit it wirelessly. This work presents Arnold: a 0.5 V to 0.8 V, 46.83 uW/MHz, 600 MOPS fully programmable RISC-V Microcontroller unit (MCU) fabricated in 22… ▽ More A wide range of Internet of Things (IoT) applications require powerful, energy-efficient and flexible end-nodes to acquire data from multiple sources, process and distill the sensed data through near-sensor data analytics algorithms, and transmit it wirelessly. This work presents Arnold: a 0.5 V to 0.8 V, 46.83 uW/MHz, 600 MOPS fully programmable RISC-V Microcontroller unit (MCU) fabricated in 22 nm Globalfoundries GF22FDX (GF22FDX) technology, coupled with a stateof-the-art (SoA) microcontroller to an embedded Field Programmable Gate Array (FPGA). We demonstrate the flexibility of the System-OnChip (SoC) to tackle the challenges of many emerging IoT applications, such as (i) interfacing sensors and accelerators with non-standard interfaces, (ii) performing on-the-fly pre-processing tasks on data streamed from peripherals, and (iii) accelerating near-sensor analytics, encryption, and machine learning tasks. A unique feature of the proposed SoC is the exploitation of body-biasing to reduce leakage power of the embedded FPGA (eFPGA) fabric by up to 18x at 0.5 V, achieving SoA state bitstream-retentive sleep power for the eFPGA fabric, as low as 20.5 uW. The proposed SoC provides 3.4x better performance and 2.9x better energy efficiency than other fabricated heterogeneous re-configurable SoCs of the same class. △ Less

Submitted 25 June, 2020; originally announced June 2020.

Showing 1–14 of 14 results for author: Di Mauro, A