Search | arXiv e-print repository

Towards Power Efficient DNN Accelerator Design on Reconfigurable Platform

Authors: Rourab Paul, Sreetama Sarkar, Suman Sau, Koushik Chakraborty, Sanghamitra Roy, Amlan Chakrabarti

Abstract: The exponential emergence of Field Programmable Gate Array (FPGA) has accelerated the research of hardware implementation of Deep Neural Network (DNN). Among all DNN processors, domain specific architectures, such as, Google's Tensor Processor Unit (TPU) have outperformed conventional GPUs. However, implementation of TPUs in reconfigurable hardware should emphasize energy savings to serve the gree… ▽ More The exponential emergence of Field Programmable Gate Array (FPGA) has accelerated the research of hardware implementation of Deep Neural Network (DNN). Among all DNN processors, domain specific architectures, such as, Google's Tensor Processor Unit (TPU) have outperformed conventional GPUs. However, implementation of TPUs in reconfigurable hardware should emphasize energy savings to serve the green computing requirement. Voltage scaling, a popular approach towards energy savings, can be a bit critical in FPGA as it may cause timing failure if not done in an appropriate way. In this work, we present an ultra low power FPGA implementation of a TPU for edge applications. We divide the systolic-array of a TPU into different FPGA partitions, where each partition uses different near threshold (NTC) biasing voltages to run its FPGA cores. The biasing voltage for each partition is roughly calculated by the proposed static schemes. However, further calibration of biasing voltage is done by the proposed runtime scheme. Four clustering algorithms based on the minimum slack value of different design paths of Multiply Accumulates (MACs) study the partitioning of FPGA. To overcome the timing failure caused by NTC, the MACs which have higher minimum slack are placed in lower voltage partitions and the MACs have lower minimum slack path are placed in higher voltage partitions. The proposed architecture is simulated in a commercial platform : Vivado with Xilinx Artix-7 FPGA and academic platform VTR with 22nm, 45nm, 130nm FPGAs. The simulation results substantiate the implementation of voltage scaled TPU in FPGAs and also justifies its power efficiency. △ Less

Submitted 14 February, 2022; v1 submitted 13 February, 2021; originally announced February 2021.

Comments: Manuscript

arXiv:1908.11538 [pdf, other]

IoT based Smart Access Controlled Secure Smart City Architecture Using Blockchain

Authors: Rourab Paul, Nimisha Ghosh, Suman Sau, Amlan Chakrabarti, Prasant Mahapatra

Abstract: Standard security protocols like SSL, TLS, IPSec etc. have high memory and processor consumption which makes all these security protocols unsuitable for resource constrained platforms such as Internet of Things (IoT). Blockchain (BC) finds its efficient application in IoT platform to preserve the five basic cryptographic primitives, such as confidentiality, authenticity, integrity, availability an… ▽ More Standard security protocols like SSL, TLS, IPSec etc. have high memory and processor consumption which makes all these security protocols unsuitable for resource constrained platforms such as Internet of Things (IoT). Blockchain (BC) finds its efficient application in IoT platform to preserve the five basic cryptographic primitives, such as confidentiality, authenticity, integrity, availability and non-repudiation. Conventional adoption of BC in IoT platform causes high energy consumption, delay and computational overhead which are not appropriate for various resource constrained IoT devices. This work proposes a machine learning (ML) based smart access control framework in a public and a private BC for a smart city application which makes it more efficient as compared to the existing IoT applications. The proposed IoT based smart city architecture adopts BC technology for preserving all the cryptographic security and privacy issues. Moreover, BC has very minimal overhead on IoT platform as well. This work investigates the existing threat models and critical access control issues which handle multiple permissions of various nodes and detects relevant inconsistencies to notify the corresponding nodes. Comparison in terms of all security issues with existing literature shows that the proposed architecture is competitively efficient in terms of security access control. △ Less

Submitted 9 September, 2019; v1 submitted 30 August, 2019; originally announced August 2019.

Comments: Manuscript

arXiv:1509.06891 [pdf, ps, other]

A Novel Method for Soft Error Mitigation in FPGA using Adaptive Cross Parity Code

Authors: Swagata Mandal, Rourab Paul, Suman Sau, Amlan Chakrabarti, Subhasis Chattopadhyay

Abstract: Field Programmable Gate Arrays (FPGAs) are more prone to be affected by transient faults in presence of radiation and other environmental hazards compared to Application Specific Integrated Circuits (ASICs). Hence, error mitigation and recovery techniques are absolutely necessary to protect the FPGA hardware from soft errors arising due to such transient faults. In this paper, a new efficient mult… ▽ More Field Programmable Gate Arrays (FPGAs) are more prone to be affected by transient faults in presence of radiation and other environmental hazards compared to Application Specific Integrated Circuits (ASICs). Hence, error mitigation and recovery techniques are absolutely necessary to protect the FPGA hardware from soft errors arising due to such transient faults. In this paper, a new efficient multi-bit error correcting method for FPGAs is proposed using adaptive cross parity check (ACPC) code. ACPC is easy to implement and the needed decoding circuit is also simple. In the proposed scheme total configuration memory is partitioned into two parts. One part will contain ACPC hardware, which is static and assumed to be unaffected by any kind of errors. Other portion will store the binary file for logic, which is to be protected from transient error and is assumed to be dynamically reconfigurable (Partial reconfigurable area). Binary file from the secondary memory passes through ACPC hardware and the bits for forward error correction (FEC) field are calculated before entering into the reconfigurable portion. In the runtime scenario, the data from the dynamically reconfigurable portion of the configuration memory will be read back and passed through the ACPC hardware. The ACPC hardware will correct the errors before the data enters into the dynamic configuration memory. We propose a first of its kind methodology for novel transient fault correction using ACPC code for FPGAs. To validate the design we have tested the proposed methodology with Kintex FPGA. We have also measured different parameters like critical path, power consumption, overhead resource and error correction efficiency to estimate the performance of our proposed method. △ Less

Submitted 23 September, 2015; originally announced September 2015.

Comments: Manuscript

arXiv:1507.01777 [pdf, ps, other]

FPGA based Novel High Speed DAQ System Design with Error Correction

Authors: Swagata Mandal, Suman Sau, Amlan Chakrabarti, Jogendra Saini, Sushanta Kumar Pal, Subhasish Chattopadhyay

Abstract: Present state of the art applications in the area of high energy physics experiments (HEP), radar communication, satellite communication and bio medical instrumentation require fault resilient data acquisition (DAQ) system with the data rate in the order of Gbps. In order to keep the high speed DAQ system functional in such radiation environment where direct intervention of human is not possible,… ▽ More Present state of the art applications in the area of high energy physics experiments (HEP), radar communication, satellite communication and bio medical instrumentation require fault resilient data acquisition (DAQ) system with the data rate in the order of Gbps. In order to keep the high speed DAQ system functional in such radiation environment where direct intervention of human is not possible, a robust and error free communication system is necessary. In this work we present an efficient DAQ design and its implementation on field programmable gate array (FPGA). The proposed DAQ system supports high speed data communication (~4.8 Gbps) and achieves multi-bit error correction capabilities. BCH code (named after Raj Bose and D. K. RayChaudhuri) has been used for multi-bit error correction. The design has been implemented on Xilinx Kintex-7 board and is tested for board to board communication as well as for board to PC using PCIe (Peripheral Component Interconnect express) interface. To the best of our knowledge, the proposed FPGA based high speed DAQ system utilizing optical link and multi-bit error resiliency can be considered first of its kind. Performance estimation of the implemented DAQ system is done based on resource utilization, critical path delay, efficiency and bit error rate (BER). △ Less

Submitted 7 July, 2015; originally announced July 2015.

Comments: ISVLSI 2015. arXiv admin note: substantial text overlap with arXiv:1505.04569, arXiv:1503.08819

Report number: 01A

arXiv:1505.04569 [pdf, ps, other]

doi 10.1088/1742-6596/664/8/082049

High speed fault tolerant secure communication for muon chamber using fpga based gbt emulator

Authors: Suman Sau, Swagata Mandal, Jogender Saini, Amlan Chakrabarti, Subhasis Chattopadhyay

Abstract: The Compressed Baryonic Matter (CBM) experiment is a part of the Facility for Antiproton and Ion Research (FAIR) in Darmstadt at the GSI. The CBM experiment will investigate the highly compressed nuclear matter using nucleus-nucleus collisions. This experiment will examine heavy-ion collisions in fixed target geometry and will be able to measure hadrons, electrons and muons. CBM requires precise t… ▽ More The Compressed Baryonic Matter (CBM) experiment is a part of the Facility for Antiproton and Ion Research (FAIR) in Darmstadt at the GSI. The CBM experiment will investigate the highly compressed nuclear matter using nucleus-nucleus collisions. This experiment will examine heavy-ion collisions in fixed target geometry and will be able to measure hadrons, electrons and muons. CBM requires precise time synchronization, compact hardware, radiation tolerance, self-triggered front-end electronics, efficient data aggregation schemes and capability to handle high data rate (up to several TB/s). As a part of the implementation of read out chain of MUCH in India, we have tried to implement FPGA based emulator of GBTx in India. GBTx is a radiation tolerant ASIC that can be used to implement multipurpose high speed bidirectional optical links for high-energy physics (HEP) experiments and is developed by CERN. GBTx will be used in highly irradiated area and more prone to be affected by multi bit error. To mitigate this effect instead of single bit error correcting RS code we have used two bit error correcting (15, 7) BCH code. It will increase the redundancy which in turn increases the reliability of the coded data. So the coded data will be less prone to be affected by noise due to radiation. Data will go from detector to PC through multiple nodes through the communication channel. In order to make the data communication secure, advanced encryption standard (AES - a symmetric key cryptography) and RSA (asymmetric key cryptography) are used after the channel coding. △ Less

Submitted 18 May, 2015; originally announced May 2015.

arXiv:1503.08819 [pdf, ps, other]

FPGA based High Speed Data Acquisition System for High Energy Physics Application

Authors: Swagata Mandal, Suman Sau, Amlan Chakrabarti, Subhasis Chattopadhyay

Abstract: In high energy physics experiments (HEP), high speed and fault resilient data communication is needed between detectors/sensors and the host PC. Transient faults can occur in the communication hardware due to various external effects like presence of charged particles, noise in the environment or radiation effects in HEP experiments and that leads to single/multiple bit error. In order to keep the… ▽ More In high energy physics experiments (HEP), high speed and fault resilient data communication is needed between detectors/sensors and the host PC. Transient faults can occur in the communication hardware due to various external effects like presence of charged particles, noise in the environment or radiation effects in HEP experiments and that leads to single/multiple bit error. In order to keep the communication system functional in such a radiation environment where direct intervention of human is not possible, a high speed data acquisition (DAQ) architecture is necessary which supports error recovery. This design presents an efficient implementation of field programmable gate array (FPGA) based high speed DAQ system with optical communication link supported by multi-bit error correcting model. The design has been implemented on Xilinx Kintex-7 board and is tested for board to board communication as well as for PC communication using PCI (Peripheral Component Interconnect express). Data communication speed up to 4.8 Gbps has been achieved in board to board and board to PC communication and estimation of resource utilization and critical path delay are also measured. △ Less

Submitted 30 March, 2015; originally announced March 2015.

arXiv:1212.6303 [pdf]

A brief experience on journey through hardware developments for image processing and its applications on Cryptography

Authors: Sangeet Saha, Chandrajit pal, Rourab paul, Satyabrata Maity, Suman Sau

Abstract: The importance of embedded applications on image and video processing,communication and cryptography domain has been taking a larger space in current research era. Improvement of pictorial information for betterment of human perception like deblurring, de-noising in several fields such as satellite imaging, medical imaging etc are renewed research thrust. Specifically we would like to elaborate ou… ▽ More The importance of embedded applications on image and video processing,communication and cryptography domain has been taking a larger space in current research era. Improvement of pictorial information for betterment of human perception like deblurring, de-noising in several fields such as satellite imaging, medical imaging etc are renewed research thrust. Specifically we would like to elaborate our experience on the significance of computer vision as one of the domains where hardware implemented algorithms perform far better than those implemented through software. So far embedded design engineers have successfully implemented their designs by means of Application Specific Integrated Circuits (ASICs) and/or Digital Signal Processors (DSP), however with the advancement of VLSI technology a very powerful hardware device namely the Field Programmable Gate Array (FPGA) combining the key advantages of ASICs and DSPs was developed which have the possibility of reprogramming making them a very attractive device for rapid prototyping.Communication of image and video data in multiple FPGA is no longer far away from the thrust of secured transmission among them, and then the relevance of cryptography is indeed unavoidable. This paper shows how the Xilinx hardware development platform as well Mathworks Matlab can be used to develop hardware based computer vision algorithms and its corresponding crypto transmission channel between multiple FPGA platform from a system level approach, making it favourable for developing a hardware-software co-design environment. △ Less

Submitted 27 December, 2012; originally announced December 2012.

Comments: In the proceedings of 100th Indian Science Congress,03-07 January,Kolkata

arXiv:1206.1567 [pdf]

Architecture for real time continuous sorting on large width data volume for fpga based applications

Authors: Rourab Paul, Suman Sau, Amlan Chakrabarti

Abstract: In engineering applications sorting is an important and widely studied problem where execution speed and resources used for computation are of extreme importance, especially if we think about real time data processing. Most of the traditional sorting techniques compute the process after receiving all of the data and hence the process needs large amount of resources for data storage. So, suitable d… ▽ More In engineering applications sorting is an important and widely studied problem where execution speed and resources used for computation are of extreme importance, especially if we think about real time data processing. Most of the traditional sorting techniques compute the process after receiving all of the data and hence the process needs large amount of resources for data storage. So, suitable design strategy needs to be adopted if we wish to sort a large amount of data in real time, which essential means higher speed of process execution and utilization of fewer resources in most of the cases. This paper proposes a single chip scalable architecture based on Field Programmable Gate Array(FPGA), for a modified counting sort algorithm where data acquisition and sorting is being done in real time scenario. Our design promises to work efficiently, where data can be accepted in the run time scenario without any need of prior storage of data and also the execution speed of our algorithm is invariant to the length of the data stream. The proposed design is implemented and verified on Spartan 3E(XC3S500E-FG320) FPGA system. The results prove that our design is better in terms of some of the design parameters compared to the existing research works. △ Less

Submitted 7 June, 2012; originally announced June 2012.

Comments: 5 pages,RASTM,2011 INDORE

arXiv:1205.2153 [pdf]

Design and implementation of real time AES-128 on real time operating system for multiple FPGA communication

Authors: Rourab Paul, Sangeet Saha, Suman Sau, Amlan Chakrabarti

Abstract: Security is the most important part in data communication system, where more randomization in secret keys increases the security as well as complexity of the cryptography algorithms. As a result in recent dates these algorithms are compensating with enormous memory spaces and large execution time on hardware platform. Field programmable gate arrays (FPGAs), provide one of the major alternative in… ▽ More Security is the most important part in data communication system, where more randomization in secret keys increases the security as well as complexity of the cryptography algorithms. As a result in recent dates these algorithms are compensating with enormous memory spaces and large execution time on hardware platform. Field programmable gate arrays (FPGAs), provide one of the major alternative in hardware platform scenario due to its reconfiguration nature, low price and marketing speed. In FPGA based embedded system we can use embedded processor to execute particular algorithm with the inclusion of a real time operating System (RTOS), where threads may reduce resource utilization and time consumption. A process in the runtime is separated in different smaller tasks which are executed by the scheduler to meet the real time dead line using RTOS. In this paper we demonstrate the design and implementation of a 128-bit Advanced Encryption Standard (AES) both symmetric key encryption and decryption algorithm by developing suitable hardware and software design on Xilinx Spartan- 3E (XC3S500E-FG320) device using an Xilkernel RTOS, the implementation has been tested successfully The system is optimized in terms of execution speed and hardware utilization. △ Less

Submitted 10 May, 2012; originally announced May 2012.

Comments: 6 pages, IEMCON 12, Kolkata

Showing 1–9 of 9 results for author: Sau, S