-
M18K: A Comprehensive RGB-D Dataset and Benchmark for Mushroom Detection and Instance Segmentation
Authors:
Abdollah Zakeri,
Mulham Fawakherji,
Jiming Kang,
Bikram Koirala,
Venkatesh Balan,
Weihang Zhu,
Driss Benhaddou,
Fatima A. Merchant
Abstract:
Automating agricultural processes holds significant promise for enhancing efficiency and sustainability in various farming practices. This paper contributes to the automation of agricultural processes by providing a dedicated mushroom detection dataset related to automated harvesting, growth monitoring, and quality control of the button mushroom produced using Agaricus Bisporus fungus. With over 1…
▽ More
Automating agricultural processes holds significant promise for enhancing efficiency and sustainability in various farming practices. This paper contributes to the automation of agricultural processes by providing a dedicated mushroom detection dataset related to automated harvesting, growth monitoring, and quality control of the button mushroom produced using Agaricus Bisporus fungus. With over 18,000 mushroom instances in 423 RGB-D image pairs taken with an Intel RealSense D405 camera, it fills the gap in mushroom-specific datasets and serves as a benchmark for detection and instance segmentation algorithms in smart mushroom agriculture. The dataset, featuring realistic growth environment scenarios with comprehensive annotations, is assessed using advanced detection and instance segmentation algorithms. The paper details the dataset's characteristics, evaluates algorithmic performance, and for broader applicability, we have made all resources publicly available including images, codes, and trained models via our GitHub repository https://github.com/abdollahzakeri/m18k
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Resistive Memory for Computing and Security: Algorithms, Architectures, and Platforms
Authors:
Simranjeet Singh,
Farhad Merchant,
Sachin Patkar
Abstract:
Resistive random-access memory (RRAM) is gaining popularity due to its ability to offer computing within the memory and its non-volatile nature. The unique properties of RRAM, such as binary switching, multi-state switching, and device variations, can be leveraged to design novel techniques and algorithms. This thesis proposes a technique for utilizing RRAM devices in three major directions: i) di…
▽ More
Resistive random-access memory (RRAM) is gaining popularity due to its ability to offer computing within the memory and its non-volatile nature. The unique properties of RRAM, such as binary switching, multi-state switching, and device variations, can be leveraged to design novel techniques and algorithms. This thesis proposes a technique for utilizing RRAM devices in three major directions: i) digital logic implementation, ii) multi-valued computing, and iii) hardware security primitive design. We proposed new algorithms and architectures and conducted \textit{experimental studies} on each implementation. Moreover, we developed the electronic design automation framework and hardware platforms to facilitate these experiments.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
In-Memory Mirroring: Cloning Without Reading
Authors:
Simranjeet Singh,
Ankit Bende,
Chandan Kumar Jha,
Vikas Rana,
Rolf Drechsler,
Sachin Patkar,
Farhad Merchant
Abstract:
In-memory computing (IMC) has gained significant attention recently as it attempts to reduce the impact of memory bottlenecks. Numerous schemes for digital IMC are presented in the literature, focusing on logic operations. Often, an application's description has data dependencies that must be resolved. Contemporary IMC architectures perform read followed by write operations for this purpose, which…
▽ More
In-memory computing (IMC) has gained significant attention recently as it attempts to reduce the impact of memory bottlenecks. Numerous schemes for digital IMC are presented in the literature, focusing on logic operations. Often, an application's description has data dependencies that must be resolved. Contemporary IMC architectures perform read followed by write operations for this purpose, which results in performance and energy penalties. To solve this fundamental problem, this paper presents in-memory mirroring (IMM). IMM eliminates the need for read and write-back steps, thus avoiding energy and performance penalties. Instead, we perform data movement within memory, involving row-wise and column-wise data transfers. Additionally, the IMM scheme enables parallel cloning of entire row (word) with a complexity of $\mathcal{O}(1)$. Moreover, our analysis of the energy consumption of the proposed technique using resistive random-access memory crossbar and experimentally validated JART VCM v1b model. The IMM increases energy efficiency and shows 2$\times$ performance improvement compared to conventional data movement methods.
△ Less
Submitted 4 July, 2024; v1 submitted 3 July, 2024;
originally announced July 2024.
-
Error Detection and Correction Codes for Safe In-Memory Computations
Authors:
Luca Parrini,
Taha Soliman,
Benjamin Hettwer,
Jan Micha Borrmann,
Simranjeet Singh,
Ankit Bende,
Vikas Rana,
Farhad Merchant,
Norbert Wehn
Abstract:
In-Memory Computing (IMC) introduces a new paradigm of computation that offers high efficiency in terms of latency and power consumption for AI accelerators. However, the non-idealities and defects of emerging technologies used in advanced IMC can severely degrade the accuracy of inferred Neural Networks (NN) and lead to malfunctions in safety-critical applications. In this paper, we investigate a…
▽ More
In-Memory Computing (IMC) introduces a new paradigm of computation that offers high efficiency in terms of latency and power consumption for AI accelerators. However, the non-idealities and defects of emerging technologies used in advanced IMC can severely degrade the accuracy of inferred Neural Networks (NN) and lead to malfunctions in safety-critical applications. In this paper, we investigate an architectural-level mitigation technique based on the coordinated action of multiple checksum codes, to detect and correct errors at run-time. This implementation demonstrates higher efficiency in recovering accuracy across different AI algorithms and technologies compared to more traditional methods such as Triple Modular Redundancy (TMR). The results show that several configurations of our implementation recover more than 91% of the original accuracy with less than half of the area required by TMR and less than 40% of latency overhead.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
QTFlow: Quantitative Timing-Sensitive Information Flow for Security-Aware Hardware Design on RTL
Authors:
Lennart M. Reimann,
Anshul Prashar,
Chiara Ghinami,
Rebecca Pelke,
Dominik Sisejkovic,
Farhad Merchant,
Rainer Leupers
Abstract:
In contemporary Electronic Design Automation (EDA) tools, security often takes a backseat to the primary goals of power, performance, and area optimization. Commonly, the security analysis is conducted by hand, leading to vulnerabilities in the design remaining unnoticed. Security-aware EDA tools assist the designer in the identification and removal of security threats while keeping performance an…
▽ More
In contemporary Electronic Design Automation (EDA) tools, security often takes a backseat to the primary goals of power, performance, and area optimization. Commonly, the security analysis is conducted by hand, leading to vulnerabilities in the design remaining unnoticed. Security-aware EDA tools assist the designer in the identification and removal of security threats while keeping performance and area in mind. Cutting-edge methods employ information flow analysis to identify inadvertent information leaks in design structures. Current information leakage detection methods use quantitative information flow analysis to quantify the leaks. However, handling sequential circuits poses challenges for state-of-the-art techniques due to their time-agnostic nature, overlooking timing channels, and introducing false positives. To address this, we introduce QTFlow, a timing-sensitive framework for quantifying hardware information leakages during the design phase. Illustrating its effectiveness on open-source benchmarks, QTFlow autonomously identifies timing channels and diminishes all false positives arising from time-agnostic analysis when contrasted with current state-of-the-art techniques.
△ Less
Submitted 6 February, 2024; v1 submitted 31 January, 2024;
originally announced January 2024.
-
Experimental Validation of Memristor-Aided Logic Using 1T1R TaOx RRAM Crossbar Array
Authors:
Ankit Bende,
Simranjeet Singh,
Chandan Kumar Jha,
Tim Kempen,
Felix Cüppers,
Christopher Bengel,
Andre Zambanini,
Dennis Nielinger,
Sachin Patkar,
Rolf Drechsler,
Rainer Waser,
Farhad Merchant,
Vikas Rana
Abstract:
Memristor-aided logic (MAGIC) design style holds a high promise for realizing digital logic-in-memory functionality. The ability to implement a specific gate in a MAGIC design style hinges on the SET-to-RESET threshold ratio. The TaOx memristive devices exhibit distinct SET-to-RESET ratios, enabling the implementation of OR and NOT operations. As the adoption of the MAGIC design style gains moment…
▽ More
Memristor-aided logic (MAGIC) design style holds a high promise for realizing digital logic-in-memory functionality. The ability to implement a specific gate in a MAGIC design style hinges on the SET-to-RESET threshold ratio. The TaOx memristive devices exhibit distinct SET-to-RESET ratios, enabling the implementation of OR and NOT operations. As the adoption of the MAGIC design style gains momentum, it becomes crucial to understand the breakdown of energy consumption in the various phases of its operation. This paper presents experimental demonstrations of the OR and NOT gates on a 1T1R crossbar array. Additionally, it provides insights into the energy distribution for performing these operations at different stages. Through our experiments across different gates, we found that the energy consumption is dominated by initialization in the MAGIC design style. The energy split-up is 14.8%, 85%, and 0.2% for execution, initialization, and read operations respectively.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
MemSPICE: Automated Simulation and Energy Estimation Framework for MAGIC-Based Logic-in-Memory
Authors:
Simranjeet Singh,
Chandan Kumar Jha,
Ankit Bende,
Vikas Rana,
Sachin Patkar,
Rolf Drechsler,
Farhad Merchant
Abstract:
Existing logic-in-memory (LiM) research is limited to generating mappings and micro-operations. In this paper, we present~\emph{MemSPICE}, a novel framework that addresses this gap by automatically generating both the netlist and testbench needed to evaluate the LiM on a memristive crossbar. MemSPICE goes beyond conventional approaches by providing energy estimation scripts to calculate the precis…
▽ More
Existing logic-in-memory (LiM) research is limited to generating mappings and micro-operations. In this paper, we present~\emph{MemSPICE}, a novel framework that addresses this gap by automatically generating both the netlist and testbench needed to evaluate the LiM on a memristive crossbar. MemSPICE goes beyond conventional approaches by providing energy estimation scripts to calculate the precise energy consumption of the testbench at the SPICE level. We propose an automated framework that utilizes the mapping obtained from the SIMPLER tool to perform accurate energy estimation through SPICE simulations. To the best of our knowledge, no existing framework is capable of generating a SPICE netlist from a hardware description language. By offering a comprehensive solution for SPICE-based netlist generation, testbench creation, and accurate energy estimation, MemSPICE empowers researchers and engineers working on memristor-based LiM to enhance their understanding and optimization of energy usage in these systems. Finally, we tested the circuits from the ISCAS'85 benchmark on MemSPICE and conducted a detailed energy analysis.
△ Less
Submitted 9 September, 2023;
originally announced September 2023.
-
SoftFlow: Automated HW-SW Confidentiality Verification for Embedded Processors
Authors:
Lennart M. Reimann,
Jonathan Wiesner,
Dominik Sisejkovic,
Farhad Merchant,
Rainer Leupers
Abstract:
Despite its ever-increasing impact, security is not considered as a design objective in commercial electronic design automation (EDA) tools. This results in vulnerabilities being overlooked during the software-hardware design process. Specifically, vulnerabilities that allow leakage of sensitive data might stay unnoticed by standard testing, as the leakage itself might not result in evident functi…
▽ More
Despite its ever-increasing impact, security is not considered as a design objective in commercial electronic design automation (EDA) tools. This results in vulnerabilities being overlooked during the software-hardware design process. Specifically, vulnerabilities that allow leakage of sensitive data might stay unnoticed by standard testing, as the leakage itself might not result in evident functional changes. Therefore, EDA tools are needed to elaborate the confidentiality of sensitive data during the design process. However, state-of-the-art implementations either solely consider the hardware or restrict the expressiveness of the security properties that must be proven. Consequently, more proficient tools are required to assist in the software and hardware design. To address this issue, we propose SoftFlow, an EDA tool that allows determining whether a given software exploits existing leakage paths in hardware. Based on our analysis, the leakage paths can be retained if proven not to be exploited by software. This is desirable if the removal significantly impacts the design's performance or functionality, or if the path cannot be removed as the chip is already manufactured. We demonstrate the feasibility of SoftFlow by identifying vulnerabilities in OpenSSL cryptographic C programs, and redesigning them to avoid leakage of cryptographic keys in a RISC-V architecture.
△ Less
Submitted 4 August, 2023;
originally announced August 2023.
-
Should We Even Optimize for Execution Energy? Rethinking Mapping for MAGIC Design Style
Authors:
Simranjeet Singh,
Chandan Kumar Jha,
Ankit Bende,
Phrangboklang Lyngton Thangkhiew,
Vikas Rana,
Sachin Patkar,
Rolf Drechsler,
Farhad Merchant
Abstract:
Memristor-based logic-in-memory (LiM) has become popular as a means to overcome the von Neumann bottleneck in traditional data-intensive computing. Recently, the memristor-aided logic (MAGIC) design style has gained immense traction for LiM due to its simplicity. However, understanding the energy distribution during the design of logic operations within the memristive memory is crucial in assessin…
▽ More
Memristor-based logic-in-memory (LiM) has become popular as a means to overcome the von Neumann bottleneck in traditional data-intensive computing. Recently, the memristor-aided logic (MAGIC) design style has gained immense traction for LiM due to its simplicity. However, understanding the energy distribution during the design of logic operations within the memristive memory is crucial in assessing such an implementation's significance. The current energy estimation methods rely on coarse-grained techniques, which underestimate the energy consumption of MAGIC-styled operations performed on a memristor crossbar. To address this issue, we analyze the energy breakdown in MAGIC operations and propose a solution that utilizes mapping from the SIMPLER MAGIC tool to achieve accurate energy estimation through SPICE simulations. In contrast to existing research that primarily focuses on optimizing execution energy, our findings reveal that the memristor's initialization energy in the MAGIC design style is, on average, 68x higher. We demonstrate that this initialization energy significantly dominates the overall energy consumption. By highlighting this aspect, we aim to redirect the attention of designers towards developing algorithms and strategies that prioritize optimizations in initializations rather than execution for more effective energy savings.
△ Less
Submitted 7 July, 2023;
originally announced July 2023.
-
IMBUE: In-Memory Boolean-to-CUrrent Inference ArchitecturE for Tsetlin Machines
Authors:
Omar Ghazal,
Simranjeet Singh,
Tousif Rahman,
Shengqi Yu,
Yujin Zheng,
Domenico Balsamo,
Sachin Patkar,
Farhad Merchant,
Fei Xia,
Alex Yakovlev,
Rishad Shafik
Abstract:
In-memory computing for Machine Learning (ML) applications remedies the von Neumann bottlenecks by organizing computation to exploit parallelism and locality. Non-volatile memory devices such as Resistive RAM (ReRAM) offer integrated switching and storage capabilities showing promising performance for ML applications. However, ReRAM devices have design challenges, such as non-linear digital-analog…
▽ More
In-memory computing for Machine Learning (ML) applications remedies the von Neumann bottlenecks by organizing computation to exploit parallelism and locality. Non-volatile memory devices such as Resistive RAM (ReRAM) offer integrated switching and storage capabilities showing promising performance for ML applications. However, ReRAM devices have design challenges, such as non-linear digital-analog conversion and circuit overheads. This paper proposes an In-Memory Boolean-to-Current Inference Architecture (IMBUE) that uses ReRAM-transistor cells to eliminate the need for such conversions. IMBUE processes Boolean feature inputs expressed as digital voltages and generates parallel current paths based on resistive memory states. The proportional column current is then translated back to the Boolean domain for further digital processing. The IMBUE architecture is inspired by the Tsetlin Machine (TM), an emerging ML algorithm based on intrinsically Boolean logic. The IMBUE architecture demonstrates significant performance improvements over binarized convolutional neural networks and digital TM in-memory implementations, achieving up to a 12.99x and 5.28x increase, respectively.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
Finite State Automata Design using 1T1R ReRAM Crossbar
Authors:
Simranjeet Singh,
Omar Ghazal,
Chandan Kumar Jha,
Vikas Rana,
Rolf Drechsler,
Rishad Shafik,
Alex Yakovlev,
Sachin Patkar,
Farhad Merchant
Abstract:
Data movement costs constitute a significant bottleneck in modern machine learning (ML) systems. When combined with the computational complexity of algorithms, such as neural networks, designing hardware accelerators with low energy footprint remains challenging. Finite state automata (FSA) constitute a type of computation model used as a low-complexity learning unit in ML systems. The implementat…
▽ More
Data movement costs constitute a significant bottleneck in modern machine learning (ML) systems. When combined with the computational complexity of algorithms, such as neural networks, designing hardware accelerators with low energy footprint remains challenging. Finite state automata (FSA) constitute a type of computation model used as a low-complexity learning unit in ML systems. The implementation of FSA consists of a number of memory states. However, FSA can be in one of the states at a given time. It switches to another state based on the present state and input to the FSA. Due to its natural synergy with memory, it is a promising candidate for in-memory computing for reduced data movement costs. This work focuses on a novel FSA implementation using resistive RAM (ReRAM) for state storage in series with a CMOS transistor for biasing controls. We propose using multi-level ReRAM technology capable of transitioning between states depending on bias pulse amplitude and duration. We use an asynchronous control circuit for writing each ReRAM-transistor cell for the on-demand switching of the FSA. We investigate the impact of the device-to-device and cycle-to-cycle variations on the cell and show that FSA transitions can be seamlessly achieved without degradation of performance. Through extensive experimental evaluation, we demonstrate the implementation of FSA on 1T1R ReRAM crossbar.
△ Less
Submitted 30 June, 2023; v1 submitted 26 April, 2023;
originally announced April 2023.
-
Integrated Architecture for Neural Networks and Security Primitives using RRAM Crossbar
Authors:
Simranjeet Singh,
Furqan Zahoor,
Gokulnath Rajendran,
Vikas Rana,
Sachin Patkar,
Anupam Chattopadhyay,
Farhad Merchant
Abstract:
This paper proposes an architecture that integrates neural networks (NNs) and hardware security modules using a single resistive random access memory (RRAM) crossbar. The proposed architecture enables using a single crossbar to implement NN, true random number generator (TRNG), and physical unclonable function (PUF) applications while exploiting the multi-state storage characteristic of the RRAM c…
▽ More
This paper proposes an architecture that integrates neural networks (NNs) and hardware security modules using a single resistive random access memory (RRAM) crossbar. The proposed architecture enables using a single crossbar to implement NN, true random number generator (TRNG), and physical unclonable function (PUF) applications while exploiting the multi-state storage characteristic of the RRAM crossbar for the vector-matrix multiplication operation required for the implementation of NN. The TRNG is implemented by utilizing the crossbar's variation in device switching thresholds to generate random bits. The PUF is implemented using the same crossbar initialized as an entropy source for the TRNG. Additionally, the weights locking concept is introduced to enhance the security of NNs by preventing unauthorized access to the NN weights. The proposed architecture provides flexibility to configure the RRAM device in multiple modes to suit different applications. It shows promise in achieving a more efficient and compact design for the hardware implementation of NNs and security primitives.
△ Less
Submitted 1 May, 2023; v1 submitted 26 April, 2023;
originally announced April 2023.
-
Gate Camouflaging Using Reconfigurable ISFET-Based Threshold Voltage Defined Logic
Authors:
Elmira Moussavi,
Animesh Singh,
Dominik Sisejkovic,
Aravind Padma Kumar,
Daniyar Kizatov,
Sven Ingebrandt,
Rainer Leupers,
Vivek Pachauri,
Farhad Merchant
Abstract:
Most chip designers outsource the manufacturing of their integrated circuits (ICs) to external foundries due to the exorbitant cost and complexity of the process. This involvement of untrustworthy, external entities opens the door to major security threats, such as reverse engineering (RE). RE can reveal the physical structure and functionality of intellectual property (IP) and ICs, leading to IP…
▽ More
Most chip designers outsource the manufacturing of their integrated circuits (ICs) to external foundries due to the exorbitant cost and complexity of the process. This involvement of untrustworthy, external entities opens the door to major security threats, such as reverse engineering (RE). RE can reveal the physical structure and functionality of intellectual property (IP) and ICs, leading to IP theft, counterfeiting, and other misuses. The concept of the threshold voltage-defined (TVD) logic family is a potential mechanism to obfuscate and protect the design and prevent RE. However, it addresses post-fabrication RE issues, and it has been shown that dopant profiling techniques can be used to determine the threshold voltage of the transistor and break the obfuscation. In this work, we propose a novel TVD modulation with ion-sensitive field-effect transistors (ISFETs) to protect the IC from RE and IP piracy. Compared to the conventional TVD logic family, ISFET-TVD allows post-manufacture programming. The ISFET-TVD logic gate can be reconfigured after fabrication, maintaining an exact schematic architecture with an identical layout for all types of logic gates, and thus overcoming the shortcomings of the classic TVD. The threshold voltage of the ISFETs can be adjusted after fabrication by changing the ion concentration of the material in contact with the ion-sensitive gate of the transistor, depending on the Boolean functionality. The ISFET is CMOS compatible, and therefore implemented on 45 nm CMOS technology for demonstration.
△ Less
Submitted 12 April, 2023;
originally announced April 2023.
-
Hardware Security Primitives using Passive RRAM Crossbar Array: Novel TRNG and PUF Designs
Authors:
Simranjeet Singh,
Furqan Zahoor,
Gokulnath Rajendran,
Sachin Patkar,
Anupam Chattopadhyay,
Farhad Merchant
Abstract:
With rapid advancements in electronic gadgets, the security and privacy aspects of these devices are significant. For the design of secure systems, physical unclonable function (PUF) and true random number generator (TRNG) are critical hardware security primitives for security applications. This paper proposes novel implementations of PUF and TRNGs on the RRAM crossbar structure. Firstly, two tech…
▽ More
With rapid advancements in electronic gadgets, the security and privacy aspects of these devices are significant. For the design of secure systems, physical unclonable function (PUF) and true random number generator (TRNG) are critical hardware security primitives for security applications. This paper proposes novel implementations of PUF and TRNGs on the RRAM crossbar structure. Firstly, two techniques to implement the TRNG in the RRAM crossbar are presented based on write-back and 50% switching probability pulse. The randomness of the proposed TRNGs is evaluated using the NIST test suite. Next, an architecture to implement the PUF in the RRAM crossbar is presented. The initial entropy source for the PUF is used from TRNGs, and challenge-response pairs (CRPs) are collected. The proposed PUF exploits the device variations and sneak-path current to produce unique CRPs. We demonstrate, through extensive experiments, reliability of 100%, uniqueness of 47.78%, uniformity of 49.79%, and bit-aliasing of 48.57% without any post-processing techniques. Finally, the design is compared with the literature to evaluate its implementation efficiency, which is clearly found to be superior to the state-of-the-art.
△ Less
Submitted 7 November, 2022;
originally announced November 2022.
-
A Temperature Independent Readout Circuit for ISFET-Based Sensor Applications
Authors:
Elmira Moussavi,
Dominik Sisejkovic,
Animesh Singh,
Daniyar Kizatov,
Rainer Leupers,
Sven Ingebrandt,
Vivek Pachauri,
Farhad Merchant
Abstract:
The ion-sensitive field-effect transistor (ISFET) is an emerging technology that has received much attention in numerous research areas, including biochemistry, medicine, and security applications. However, compared to other types of sensors, the complexity of ISFETs make it more challenging to achieve a sensitive, fast and repeatable response. Therefore, various readout circuits have been develop…
▽ More
The ion-sensitive field-effect transistor (ISFET) is an emerging technology that has received much attention in numerous research areas, including biochemistry, medicine, and security applications. However, compared to other types of sensors, the complexity of ISFETs make it more challenging to achieve a sensitive, fast and repeatable response. Therefore, various readout circuits have been developed to improve the performance of ISFETs, especially to eliminate the temperature effect. This paper presents a new approach for a temperature-independent readout circuit that uses the threshold voltage differences of an ISFET-MOSFET pair. The Linear Technology Simulation Program with Integrated Circuit Emphasis (LTspice) is used to analyze the ISFET performance based on the proposed readout circuit characteristics. A macro-model is used to model ISFET behavior, including the first-level Spice model for the MOSFET part and Verilog-A to model the surface potential, reference electrode, and electrolyte of the ISFET to determine the relationships between variables.In this way, the behavior of the ISFET is monitored by the output voltage of the readout circuit based on a change in the electrolyte's hydrogen potential (pH), determined by the simulation. The proposed readout circuit has a temperature coefficient of 11.9 $ppm/°C$ for a temperature range of 0-100 $°C$ and pH between 1 and 13. The proposed ISFET readout circuit outperforms other designs in terms of simplicity and not requiring an additional sensor.
△ Less
Submitted 9 August, 2022;
originally announced August 2022.
-
PA-PUF: A Novel Priority Arbiter PUF
Authors:
Simranjeet Singh,
Srinivasu Bodapati,
Sachin Patkar,
Rainer Leupers,
Anupam Chattopadhyay,
Farhad Merchant
Abstract:
This paper proposes a 3-input arbiter-based novel physically unclonable function (PUF) design. Firstly, a 3-input priority arbiter is designed using a simple arbiter, two multiplexers (2:1), and an XOR logic gate. The priority arbiter has an equal probability of 0's and 1's at the output, which results in excellent uniformity (49.45%) while retrieving the PUF response. Secondly, a new PUF design b…
▽ More
This paper proposes a 3-input arbiter-based novel physically unclonable function (PUF) design. Firstly, a 3-input priority arbiter is designed using a simple arbiter, two multiplexers (2:1), and an XOR logic gate. The priority arbiter has an equal probability of 0's and 1's at the output, which results in excellent uniformity (49.45%) while retrieving the PUF response. Secondly, a new PUF design based on priority arbiter PUF (PA-PUF) is presented. The PA-PUF design is evaluated for uniqueness, non-linearity, and uniformity against the standard tests. The proposed PA-PUF design is configurable in challenge-response pairs through an arbitrary number of feed-forward priority arbiters introduced to the design. We demonstrate, through extensive experiments, reliability of 100% after performing the error correction techniques and uniqueness of 49.63%. Finally, the design is compared with the literature to evaluate its implementation efficiency, where it is clearly found to be superior compared to the state-of-the-art.
△ Less
Submitted 21 July, 2022;
originally announced July 2022.
-
pHGen: A pH-Based Key Generation Mechanism Using ISFETs
Authors:
Elmira Moussavi,
Dominik Sisejkovic,
Fabian Brings,
Daniyar Kizatov,
Animesh Singh,
Xuan Thang Vu,
Sven Ingebrandt,
Rainer Leupers,
Vivek Pachauri,
Farhad Merchant
Abstract:
Digital keys are a fundamental component of many hardware- and software-based security mechanisms. However, digital keys are limited to binary values and easily exploitable when stored in standard memories. In this paper, based on emerging technologies, we introduce pHGen, a potential-of-hydrogen (pH)-based key generation mechanism that leverages chemical reactions in the form of a potential chang…
▽ More
Digital keys are a fundamental component of many hardware- and software-based security mechanisms. However, digital keys are limited to binary values and easily exploitable when stored in standard memories. In this paper, based on emerging technologies, we introduce pHGen, a potential-of-hydrogen (pH)-based key generation mechanism that leverages chemical reactions in the form of a potential change in ion-sensitive field-effect transistors (ISFETs). The threshold voltage of ISFETs is manipulated corresponding to a known pH buffer solution (key) in which the transistors are immersed. To read the chemical information effectively via ISFETs, we designed a readout circuit for stable operation and detection of voltage thresholds. To demonstrate the applicability of the proposed key generation, we utilize pHGen for logic locking -- a hardware integrity protection scheme. The proposed key-generation method breaks the limits of binary values and provides the first steps toward the utilization of multi-valued voltage thresholds of ISFETs controlled by chemical information. The pHGen approach is expected to be a turning point for using more sophisticated bio-based analog keys for securing next-generation electronics.
△ Less
Submitted 24 February, 2022;
originally announced February 2022.
-
A Parallel SystemC Virtual Platform for Neuromorphic Architectures
Authors:
Melvin Galicia,
Farhad Merchant,
Rainer Leupers
Abstract:
With the increasing interest in neuromorphic computing, designers of embedded systems face the challenge of efficiently simulating such platforms to enable architecture design exploration early in the development cycle. Executing artificial neural network applications on neuromorphic systems which are being simulated on virtual platforms (VPs) is an extremely demanding computational task. Neverthe…
▽ More
With the increasing interest in neuromorphic computing, designers of embedded systems face the challenge of efficiently simulating such platforms to enable architecture design exploration early in the development cycle. Executing artificial neural network applications on neuromorphic systems which are being simulated on virtual platforms (VPs) is an extremely demanding computational task. Nevertheless, it is a vital benchmarking task for comparing different possible architectures. Therefore, exploiting the multicore capabilities of the VP's host system is essential to achieve faster simulations. Hence, this paper presents a parallel SystemC based VP for RISC-V multicore platforms integrating multiple computing-in-memory neuromorphic accelerators. In this paper, different VP segmentation architectures are explored for the integration of neuromorphic accelerators and are shown their corresponding speedup simulations compared to conventional sequential SystemC execution.
△ Less
Submitted 24 December, 2021;
originally announced December 2021.
-
NeuroHammer: Inducing Bit-Flips in Memristive Crossbar Memories
Authors:
Felix Staudigl,
Hazem Al Indari,
Daniel Schön,
Dominik Sisejkovic,
Farhad Merchant,
Jan Moritz Joseph,
Vikas Rana,
Stephan Menzel,
Rainer Leupers
Abstract:
Emerging non-volatile memory (NVM) technologies offer unique advantages in energy efficiency, latency, and features such as computing-in-memory. Consequently, emerging NVM technologies are considered an ideal substrate for computation and storage in future-generation neuromorphic platforms. These technologies need to be evaluated for fundamental reliability and security issues. In this paper, we p…
▽ More
Emerging non-volatile memory (NVM) technologies offer unique advantages in energy efficiency, latency, and features such as computing-in-memory. Consequently, emerging NVM technologies are considered an ideal substrate for computation and storage in future-generation neuromorphic platforms. These technologies need to be evaluated for fundamental reliability and security issues. In this paper, we present \emph{NeuroHammer}, a security threat in ReRAM crossbars caused by thermal crosstalk between memory cells. We demonstrate that bit-flips can be deliberately induced in ReRAM devices in a crossbar by systematically writing adjacent memory cells. A simulation flow is developed to evaluate NeuroHammer and the impact of physical parameters on the effectiveness of the attack. Finally, we discuss the security implications in the context of possible attack scenarios.
△ Less
Submitted 6 December, 2021; v1 submitted 2 December, 2021;
originally announced December 2021.
-
QFlow: Quantitative Information Flow for Security-Aware Hardware Design in Verilog
Authors:
Lennart M. Reimann,
Luca Hanel,
Dominik Sisejkovic,
Farhad Merchant,
Rainer Leupers
Abstract:
The enormous amount of code required to design modern hardware implementations often leads to critical vulnerabilities being overlooked. Especially vulnerabilities that compromise the confidentiality of sensitive data, such as cryptographic keys, have a major impact on the trustworthiness of an entire system. Information flow analysis can elaborate whether information from sensitive signals flows…
▽ More
The enormous amount of code required to design modern hardware implementations often leads to critical vulnerabilities being overlooked. Especially vulnerabilities that compromise the confidentiality of sensitive data, such as cryptographic keys, have a major impact on the trustworthiness of an entire system. Information flow analysis can elaborate whether information from sensitive signals flows towards outputs or untrusted components of the system. But most of these analytical strategies rely on the non-interference property, stating that the untrusted targets must not be influenced by the source's data, which is shown to be too inflexible for many applications. To address this issue, there are approaches to quantify the information flow between components such that insignificant leakage can be neglected. Due to the high computational complexity of this quantification, approximations are needed, which introduce mispredictions. To tackle those limitations, we reformulate the approximations. Further, we propose a tool QFlow with a higher detection rate than previous tools. It can be used by non-experienced users to identify data leakages in hardware designs, thus facilitating a security-aware design process.
△ Less
Submitted 22 December, 2021; v1 submitted 6 September, 2021;
originally announced September 2021.
-
Deceptive Logic Locking for Hardware Integrity Protection against Machine Learning Attacks
Authors:
Dominik Sisejkovic,
Farhad Merchant,
Lennart M. Reimann,
Rainer Leupers
Abstract:
Logic locking has emerged as a prominent key-driven technique to protect the integrity of integrated circuits. However, novel machine-learning-based attacks have recently been introduced to challenge the security foundations of locking schemes. These attacks are able to recover a significant percentage of the key without having access to an activated circuit. This paper address this issue through…
▽ More
Logic locking has emerged as a prominent key-driven technique to protect the integrity of integrated circuits. However, novel machine-learning-based attacks have recently been introduced to challenge the security foundations of locking schemes. These attacks are able to recover a significant percentage of the key without having access to an activated circuit. This paper address this issue through two focal points. First, we present a theoretical model to test locking schemes for key-related structural leakage that can be exploited by machine learning. Second, based on the theoretical model, we introduce D-MUX: a deceptive multiplexer-based logic-locking scheme that is resilient against structure-exploiting machine learning attacks. Through the design of D-MUX, we uncover a major fallacy in existing multiplexer-based locking schemes in the form of a structural-analysis attack. Finally, an extensive cost evaluation of D-MUX is presented. To the best of our knowledge, D-MUX is the first machine-learning-resilient locking scheme capable of protecting against all known learning-based attacks. Hereby, the presented work offers a starting point for the design and evaluation of future-generation logic locking in the era of machine learning.
△ Less
Submitted 19 July, 2021;
originally announced July 2021.
-
Logic Locking at the Frontiers of Machine Learning: A Survey on Developments and Opportunities
Authors:
Dominik Sisejkovic,
Lennart M. Reimann,
Elmira Moussavi,
Farhad Merchant,
Rainer Leupers
Abstract:
In the past decade, a lot of progress has been made in the design and evaluation of logic locking; a premier technique to safeguard the integrity of integrated circuits throughout the electronics supply chain. However, the widespread proliferation of machine learning has recently introduced a new pathway to evaluating logic locking schemes. This paper summarizes the recent developments in logic lo…
▽ More
In the past decade, a lot of progress has been made in the design and evaluation of logic locking; a premier technique to safeguard the integrity of integrated circuits throughout the electronics supply chain. However, the widespread proliferation of machine learning has recently introduced a new pathway to evaluating logic locking schemes. This paper summarizes the recent developments in logic locking attacks and countermeasures at the frontiers of contemporary machine learning models. Based on the presented work, the key takeaways, opportunities, and challenges are highlighted to offer recommendations for the design of next-generation logic locking.
△ Less
Submitted 23 November, 2021; v1 submitted 5 July, 2021;
originally announced July 2021.
-
Brightening the Optical Flow through Posit Arithmetic
Authors:
Vinay Saxena,
Ankitha Reddy,
Jonathan Neudorfer,
John Gustafson,
Sangeeth Nambiar,
Rainer Leupers,
Farhad Merchant
Abstract:
As new technologies are invented, their commercial viability needs to be carefully examined along with their technical merits and demerits. The posit data format, proposed as a drop-in replacement for IEEE 754 float format, is one such invention that requires extensive theoretical and experimental study to identify products that can benefit from the advantages of posits for specific market segment…
▽ More
As new technologies are invented, their commercial viability needs to be carefully examined along with their technical merits and demerits. The posit data format, proposed as a drop-in replacement for IEEE 754 float format, is one such invention that requires extensive theoretical and experimental study to identify products that can benefit from the advantages of posits for specific market segments. In this paper, we present an extensive empirical study of posit-based arithmetic vis-à-vis IEEE 754 compliant arithmetic for the optical flow estimation method called Lucas-Kanade (LuKa). First, we use SoftPosit and SoftFloat format emulators to perform an empirical error analysis of the LuKa method. Our study shows that the average error in LuKa with SoftPosit is an order of magnitude lower than LuKa with SoftFloat. We then present the integration of the hardware implementation of a posit adder and multiplier in a RISC-V open-source platform. We make several recommendations, along with the analysis of LuKa in the RISC-V context, for future generation platforms incorporating posit arithmetic units.
△ Less
Submitted 17 January, 2021;
originally announced January 2021.
-
ANDROMEDA: An FPGA Based RISC-V MPSoC Exploration Framework
Authors:
Farhad Merchant,
Dominik Sisejkovic,
Lennart M. Reimann,
Kirthihan Yasotharan,
Thomas Grass,
Rainer Leupers
Abstract:
With the growing demands of consumer electronic products, the computational requirements are increasing exponentially. Due to the applications' computational needs, the computer architects are trying to pack as many cores as possible on a single die for accelerated execution of the application program codes. In a multiprocessor system-on-chip (MPSoC), striking a balance among the number of cores,…
▽ More
With the growing demands of consumer electronic products, the computational requirements are increasing exponentially. Due to the applications' computational needs, the computer architects are trying to pack as many cores as possible on a single die for accelerated execution of the application program codes. In a multiprocessor system-on-chip (MPSoC), striking a balance among the number of cores, memory subsystems, and network-on-chip parameters is essential to attain the desired performance. In this paper, we present ANDROMEDA, a RISC-V based framework that allows us to explore the different configurations of an MPSoC and observe the performance penalties and gains. We emulate the various configurations of MPSoC on the Synopsys HAPS-80D Dual FPGA platform. Using STREAM, matrix multiply, and N-body simulations as benchmarks, we demonstrate our framework's efficacy in quickly identifying the right parameters for efficient execution of these benchmarks.
△ Less
Submitted 14 January, 2021;
originally announced January 2021.
-
An Investigation on Inherent Robustness of Posit Data Representation
Authors:
Ihsen Alouani,
Anouar Ben Khalifa,
Farhad Merchant,
Rainer Leupers
Abstract:
As the dimensions and operating voltages of computer electronics shrink to cope with consumers' demand for higher performance and lower power consumption, circuit sensitivity to soft errors increases dramatically. Recently, a new data-type is proposed in the literature called posit data type. Posit arithmetic has absolute advantages such as higher numerical accuracy, speed, and simpler hardware de…
▽ More
As the dimensions and operating voltages of computer electronics shrink to cope with consumers' demand for higher performance and lower power consumption, circuit sensitivity to soft errors increases dramatically. Recently, a new data-type is proposed in the literature called posit data type. Posit arithmetic has absolute advantages such as higher numerical accuracy, speed, and simpler hardware design than IEEE 754-2008 technical standard-compliant arithmetic. In this paper, we propose a comparative robustness study between 32-bit posit and 32-bit IEEE 754-2008 compliant representations. At first, we propose a theoretical analysis for IEEE 754 compliant numbers and posit numbers for single bit flip and double bit flips. Then, we conduct exhaustive fault injection experiments that show a considerable inherent resilience in posit format compared to classical IEEE 754 compliant representation. To show a relevant use-case of fault-tolerant applications, we perform experiments on a set of machine-learning applications. In more than 95% of the exhaustive fault injection exploration, posit representation is less impacted by faults than the IEEE 754 compliant floating-point representation. Moreover, in 100% of the tested machine-learning applications, the accuracy of posit-implemented systems is higher than the classical floating-point-based ones.
△ Less
Submitted 5 January, 2021;
originally announced January 2021.
-
Challenging the Security of Logic Locking Schemes in the Era of Deep Learning: A Neuroevolutionary Approach
Authors:
Dominik Sisejkovic,
Farhad Merchant,
Lennart M. Reimann,
Harshit Srivastava,
Ahmed Hallawa,
Rainer Leupers
Abstract:
Logic locking is a prominent technique to protect the integrity of hardware designs throughout the integrated circuit design and fabrication flow. However, in recent years, the security of locking schemes has been thoroughly challenged by the introduction of various deobfuscation attacks. As in most research branches, deep learning is being introduced in the domain of logic locking as well. Theref…
▽ More
Logic locking is a prominent technique to protect the integrity of hardware designs throughout the integrated circuit design and fabrication flow. However, in recent years, the security of locking schemes has been thoroughly challenged by the introduction of various deobfuscation attacks. As in most research branches, deep learning is being introduced in the domain of logic locking as well. Therefore, in this paper we present SnapShot: a novel attack on logic locking that is the first of its kind to utilize artificial neural networks to directly predict a key bit value from a locked synthesized gate-level netlist without using a golden reference. Hereby, the attack uses a simpler yet more flexible learning model compared to existing work. Two different approaches are evaluated. The first approach is based on a simple feedforward fully connected neural network. The second approach utilizes genetic algorithms to evolve more complex convolutional neural network architectures specialized for the given task. The attack flow offers a generic and customizable framework for attacking locking schemes using machine learning techniques. We perform an extensive evaluation of SnapShot for two realistic attack scenarios, comprising both reference benchmark circuits as well as silicon-proven RISC-V core modules. The evaluation results show that SnapShot achieves an average key prediction accuracy of 82.60% for the selected attack scenario, with a significant performance increase of 10.49 percentage points compared to the state of the art. Moreover, SnapShot outperforms the existing technique on all evaluated benchmarks. The results indicate that the security foundation of common logic locking schemes is build on questionable assumptions. The conclusions of the evaluation offer insights into the challenges of designing future logic locking schemes that are resilient to machine learning attacks.
△ Less
Submitted 30 November, 2020; v1 submitted 20 November, 2020;
originally announced November 2020.
-
ExPAN(N)D: Exploring Posits for Efficient Artificial Neural Network Design in FPGA-based Systems
Authors:
Suresh Nambi,
Salim Ullah,
Aditya Lohana,
Siva Satyendra Sahoo,
Farhad Merchant,
Akash Kumar
Abstract:
The recent advances in machine learning, in general, and Artificial Neural Networks (ANN), in particular, has made smart embedded systems an attractive option for a larger number of application areas. However, the high computational complexity, memory footprints, and energy requirements of machine learning models hinder their deployment on resource-constrained embedded systems. Most state-of-the-a…
▽ More
The recent advances in machine learning, in general, and Artificial Neural Networks (ANN), in particular, has made smart embedded systems an attractive option for a larger number of application areas. However, the high computational complexity, memory footprints, and energy requirements of machine learning models hinder their deployment on resource-constrained embedded systems. Most state-of-the-art works have considered this problem by proposing various low bit-width data representation schemes, optimized arithmetic operators' implementations, and different complexity reduction techniques such as network pruning. To further elevate the implementation gains offered by these individual techniques, there is a need to cross-examine and combine these techniques' unique features. This paper presents ExPAN(N)D, a framework to analyze and ingather the efficacy of the Posit number representation scheme and the efficiency of fixed-point arithmetic implementations for ANNs. The Posit scheme offers a better dynamic range and higher precision for various applications than IEEE $754$ single-precision floating-point format. However, due to the dynamic nature of the various fields of the Posit scheme, the corresponding arithmetic circuits have higher critical path delay and resource requirements than the single-precision-based arithmetic units. Towards this end, we propose a novel Posit to fixed-point converter for enabling high-performance and energy-efficient hardware implementations for ANNs with minimal drop in the output accuracy. We also propose a modified Posit-based representation to store the trained parameters of a network. Compared to an $8$-bit fixed-point-based inference accelerator, our proposed implementation offers $\approx46\%$ and $\approx18\%$ reductions in the storage requirements of the parameters and energy consumption of the MAC units, respectively.
△ Less
Submitted 27 October, 2020; v1 submitted 24 October, 2020;
originally announced October 2020.
-
CLARINET: A RISC-V Based Framework for Posit Arithmetic Empiricism
Authors:
Niraj Sharma,
Riya Jain,
Madhumita Mohan,
Sachin Patkar,
Rainer Leupers,
Nikhil Rishiyur,
Farhad Merchant
Abstract:
Many engineering and scientific applications require high precision arithmetic. IEEE~754-2008 compliant (floating-point) arithmetic is the de facto standard for performing these computations. Recently, posit arithmetic has been proposed as a drop-in replacement for floating-point arithmetic. The posit\texttrademark data representation and arithmetic claim several absolute advantages over the float…
▽ More
Many engineering and scientific applications require high precision arithmetic. IEEE~754-2008 compliant (floating-point) arithmetic is the de facto standard for performing these computations. Recently, posit arithmetic has been proposed as a drop-in replacement for floating-point arithmetic. The posit\texttrademark data representation and arithmetic claim several absolute advantages over the floating-point format and arithmetic, including higher dynamic range, better accuracy, and superior performance-area trade-offs. However, there does not exist any accessible, holistic framework that facilitates the validation of these claims of posit arithmetic, especially when the claims involve long accumulations (quire).
In this paper, we present a consolidated general-purpose processor-based framework to support posit arithmetic empiricism. The end-users of the framework have the liberty to seamlessly experiment with their applications using posit and floating-point arithmetic since the framework is designed for the two number systems to coexist. Melodica is a posit arithmetic core that implements parametric fused operations that uniquely involve the quire data type. Clarinet is a Melodica-enabled processor based on the RISC-V ISA. To the best of our knowledge, this is the first-ever integration of quire with a RISC-V core. To show the effectiveness of the Clarinet platform, we perform an extensive application study and benchmark some of the common linear algebra and computer vision kernels. We emulate Clarinet on a Xilinx FPGA and present utilization and timing data. Clarinet and Melodica remain actively under development and is available in open-source for posit arithmetic empiricism.
△ Less
Submitted 27 October, 2021; v1 submitted 30 May, 2020;
originally announced June 2020.
-
Efficient Realization of Givens Rotation through Algorithm-Architecture Co-design for Acceleration of QR Factorization
Authors:
Farhad Merchant,
Tarun Vatwani,
Anupam Chattopadhyay,
Soumyendu Raha,
S K Nandy,
Ranjani Narayan,
Rainer Leupers
Abstract:
We present efficient realization of Generalized Givens Rotation (GGR) based QR factorization that achieves 3-100x better performance in terms of Gflops/watt over state-of-the-art realizations on multicore, and General Purpose Graphics Processing Units (GPGPUs). GGR is an improvement over classical Givens Rotation (GR) operation that can annihilate multiple elements of rows and columns of an input…
▽ More
We present efficient realization of Generalized Givens Rotation (GGR) based QR factorization that achieves 3-100x better performance in terms of Gflops/watt over state-of-the-art realizations on multicore, and General Purpose Graphics Processing Units (GPGPUs). GGR is an improvement over classical Givens Rotation (GR) operation that can annihilate multiple elements of rows and columns of an input matrix simultaneously. GGR takes 33% lesser multiplications compared to GR. For custom implementation of GGR, we identify macro operations in GGR and realize them on a Reconfigurable Data-path (RDP) tightly coupled to pipeline of a Processing Element (PE). In PE, GGR attains speed-up of 1.1x over Modified Householder Transform (MHT) presented in the literature. For parallel realization of GGR, we use REDEFINE, a scalable massively parallel Coarse-grained Reconfigurable Architecture, and show that the speed-up attained is commensurate with the hardware resources in REDEFINE. GGR also outperforms General Matrix Multiplication (gemm) by 10% in-terms of Gflops/watt which is counter-intuitive.
△ Less
Submitted 23 March, 2018; v1 submitted 14 March, 2018;
originally announced March 2018.
-
Achieving Efficient Realization of Kalman Filter on CGRA through Algorithm-Architecture Co-design
Authors:
Farhad Merchant,
Tarun Vatwani,
Anupam Chattopadhyay,
Soumyendu Raha,
S K Nandy,
Ranjani Narayan
Abstract:
In this paper, we present efficient realization of Kalman Filter (KF) that can achieve up to 65% of the theoretical peak performance of underlying architecture platform. KF is realized using Modified Faddeeva Algorithm (MFA) as a basic building block due to its versatility and REDEFINE Coarse Grained Reconfigurable Architecture (CGRA) is used as a platform for experiments since REDEFINE is capable…
▽ More
In this paper, we present efficient realization of Kalman Filter (KF) that can achieve up to 65% of the theoretical peak performance of underlying architecture platform. KF is realized using Modified Faddeeva Algorithm (MFA) as a basic building block due to its versatility and REDEFINE Coarse Grained Reconfigurable Architecture (CGRA) is used as a platform for experiments since REDEFINE is capable of supporting realization of a set algorithmic compute structures at run-time on a Reconfigurable Data-path (RDP). We perform several hardware and software based optimizations in the realization of KF to achieve 116% improvement in terms of Gflops over the first realization of KF. Overall, with the presented approach for KF, 4-105x performance improvement in terms of Gflops/watt over several academically and commercially available realizations of KF is attained. In REDEFINE, we show that our implementation is scalable and the performance attained is commensurate with the underlying hardware resources
△ Less
Submitted 10 February, 2018;
originally announced February 2018.
-
Efficient Realization of Householder Transform through Algorithm-Architecture Co-design for Acceleration of QR Factorization
Authors:
Farhad Merchant,
Tarun Vatwani,
Anupam Chattopadhyay,
Soumyendu Raha,
S K Nandy,
Ranjani Narayan
Abstract:
We present efficient realization of Householder Transform (HT) based QR factorization through algorithm-architecture co-design where we achieve performance improvement of 3-90x in-terms of Gflops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700. Theoretical and experimental analysis of classical…
▽ More
We present efficient realization of Householder Transform (HT) based QR factorization through algorithm-architecture co-design where we achieve performance improvement of 3-90x in-terms of Gflops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700. Theoretical and experimental analysis of classical HT is performed for opportunities to exhibit higher degree of parallelism where parallelism is quantified as a number of parallel operations per level in the Directed Acyclic Graph (DAG) of the transform. Based on theoretical analysis of classical HT, an opportunity re-arrange computations in the classical HT is identified that results in Modified HT (MHT) where it is shown that MHT exhibits 1.33x times higher parallelism than classical HT. Experiments in off-the-shelf multicore and General Purpose Graphics Processing Units (GPGPUs) for HT and MHT suggest that MHT is capable of achieving slightly better or equal performance compared to classical HT based QR factorization realizations in the optimized software packages for Dense Linear Algebra (DLA). We implement MHT on a customized platform for Dense Linear Algebra (DLA) and show that MHT achieves 1.3x better performance than native implementation of classical HT on the same accelerator. For custom realization of HT and MHT based QR factorization, we also identify macro operations in the DAGs of HT and MHT that are realized on a Reconfigurable Data-path (RDP). We also observe that due to re-arrangement in the computations in MHT, custom realization of MHT is capable of achieving 12% better performance improvement over multicore and GPGPUs than the performance improvement reported by General Matrix Multiplication (GEMM) over highly tuned DLA software packages for multicore and GPGPUs which is counter-intuitive.
△ Less
Submitted 13 December, 2016;
originally announced December 2016.
-
Accelerating BLAS and LAPACK via Efficient Floating Point Architecture Design
Authors:
Farhad Merchant,
Anupam Chattopadhyay,
Soumyendu Raha,
S K Nandy,
Ranjani Narayan
Abstract:
Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) form basic building blocks for several High Performance Computing (HPC) applications and hence dictate performance of the HPC applications. Performance in such tuned packages is attained through tuning of several algorithmic and architectural parameters such as number of parallel operations in the Directed Acyclic Graph of…
▽ More
Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) form basic building blocks for several High Performance Computing (HPC) applications and hence dictate performance of the HPC applications. Performance in such tuned packages is attained through tuning of several algorithmic and architectural parameters such as number of parallel operations in the Directed Acyclic Graph of the BLAS/LAPACK routines, sizes of the memories in the memory hierarchy of the underlying platform, bandwidth of the memory, and structure of the compute resources in the underlying platform. In this paper, we closely investigate the impact of the Floating Point Unit (FPU) micro-architecture for performance tuning of BLAS and LAPACK. We present theoretical analysis for pipeline depth of different floating point operations like multiplier, adder, square root, and divider followed by characterization of BLAS and LAPACK to determine several parameters required in the theoretical framework for deciding optimum pipeline depth of the floating operations. A simple design of a Processing Element (PE) is presented and shown that the PE outperforms the most recent custom realizations of BLAS and LAPACK by 1.1X to 1.5X in Gflops/W, and 1.9X to 2.1X in Gflops/mm^2.
△ Less
Submitted 13 November, 2017; v1 submitted 27 October, 2016;
originally announced October 2016.
-
Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design
Authors:
Farhad Merchant,
Tarun Vatwani,
Anupam Chattopadhyay,
Soumyendu Raha,
S K Nandy,
Ranjani Narayan
Abstract:
Basic Linear Algebra Subprograms (BLAS) play key role in high performance and scientific computing applications. Experimentally, yesteryear multicore and General Purpose Graphics Processing Units (GPGPUs) are capable of achieving up to 15 to 57% of the theoretical peak performance at 65W to 240W respectively for compute bound operations like Double/Single Precision General Matrix Multiplication (X…
▽ More
Basic Linear Algebra Subprograms (BLAS) play key role in high performance and scientific computing applications. Experimentally, yesteryear multicore and General Purpose Graphics Processing Units (GPGPUs) are capable of achieving up to 15 to 57% of the theoretical peak performance at 65W to 240W respectively for compute bound operations like Double/Single Precision General Matrix Multiplication (XGEMM). For bandwidth bound operations like Single/Double precision Matrix-vector Multiplication (XGEMV) the performance is merely 5 to 7% of the theoretical peak performance in multicores and GPGPUs respectively. Achieving performance in BLAS requires moving away from conventional wisdom and evolving towards customized accelerator tailored for BLAS through algorithm-architecture co-design. In this paper, we present acceleration of Level-1 (vector operations), Level-2 (matrix-vector operations), and Level-3 (matrix-matrix operations) BLAS through algorithm architecture co-design on a Coarse-grained Reconfigurable Architecture (CGRA). We choose REDEFINE CGRA as a platform for our experiments since REDEFINE can be adapted to support domain of interest through tailor-made Custom Function Units (CFUs). For efficient sequential realization of BLAS, we present design of a Processing Element (PE) and perform micro-architectural enhancements in the PE to achieve up-to 74% of the theoretical peak performance of PE in DGEMM, 40% in DGEMV and 20% in double precision inner product (DDOT). We attach this PE to REDEFINE CGRA as a CFU and show the scalability of our solution. Finally, we show performance improvement of 3-140x in PE over commercially available Intel micro-architectures, ClearSpeed CSX700, FPGA, and Nvidia GPGPUs.
△ Less
Submitted 27 November, 2016; v1 submitted 20 October, 2016;
originally announced October 2016.