Search | arXiv e-print repository

SpikePipe: Accelerated Training of Spiking Neural Networks via Inter-Layer Pipelining and Multiprocessor Scheduling

Authors: Sai Sanjeet, Bibhu Datta Sahoo, Keshab K. Parhi

Abstract: Spiking Neural Networks (SNNs) have gained popularity due to their high energy efficiency. Prior works have proposed various methods for training SNNs, including backpropagation-based methods. Training SNNs is computationally expensive compared to their conventional counterparts and would benefit from multiprocessor hardware acceleration. This is the first paper to propose inter-layer pipelining t… ▽ More Spiking Neural Networks (SNNs) have gained popularity due to their high energy efficiency. Prior works have proposed various methods for training SNNs, including backpropagation-based methods. Training SNNs is computationally expensive compared to their conventional counterparts and would benefit from multiprocessor hardware acceleration. This is the first paper to propose inter-layer pipelining to accelerate training in SNNs using systolic array-based processors and multiprocessor scheduling. The impact of training using delayed gradients is observed using three networks training on different datasets, showing no degradation for small networks and < 10% degradation for large networks. The mapping of various training tasks of the SNN onto systolic arrays is formulated, and the proposed scheduling method is evaluated on the three networks. The results are compared against standard pipelining algorithms. The results show that the proposed method achieves an average speedup of 1.6X compared to standard pipelining algorithms, with an upwards of 2X improvement in some cases. The incurred communication overhead due to the proposed method is less than 0.5% of the total required communication of training. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2309.12373 [pdf, other]

doi 10.1109/TCSI.2024.3384436

Systematic Design and Optimization of Quantum Circuits for Stabilizer Codes

Authors: Arijit Mondal, Keshab K. Parhi

Abstract: Quantum computing is an emerging technology that has the potential to achieve exponential speedups over their classical counterparts. To achieve quantum advantage, quantum principles are being applied to fields such as communications, information processing, and artificial intelligence. However, quantum computers face a fundamental issue since quantum bits are extremely noisy and prone to decohere… ▽ More Quantum computing is an emerging technology that has the potential to achieve exponential speedups over their classical counterparts. To achieve quantum advantage, quantum principles are being applied to fields such as communications, information processing, and artificial intelligence. However, quantum computers face a fundamental issue since quantum bits are extremely noisy and prone to decoherence. Keeping qubits error free is one of the most important steps towards reliable quantum computing. Different stabilizer codes for quantum error correction have been proposed in past decades and several methods have been proposed to import classical error correcting codes to the quantum domain. However, formal approaches towards the design and optimization of circuits for these quantum encoders and decoders have so far not been proposed. In this paper, we propose a formal algorithm for systematic construction of encoding circuits for general stabilizer codes. This algorithm is used to design encoding and decoding circuits for an eight-qubit code. Next, we propose a systematic method for the optimization of the encoder circuit thus designed. Using the proposed method, we optimize the encoding circuit in terms of the number of 2-qubit gates used. The proposed optimized eight-qubit encoder uses 18 CNOT gates and 4 Hadamard gates, as compared to 14 single qubit gates, 33 2-qubit gates, and 6 CCNOT gates in a prior work. The encoder and decoder circuits are verified using IBM Qiskit. We also present optimized encoder circuits for Steane code and a 13-qubit code in terms of the number of gates used. △ Less

Submitted 20 September, 2023; originally announced September 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2309.11793

Journal ref: IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 72, 2024

arXiv:2309.11793 [pdf, other]

doi 10.1109/MCAS.2024.3349668

Quantum Circuits for Stabilizer Error Correcting Codes: A Tutorial

Authors: Arijit Mondal, Keshab K. Parhi

Abstract: Quantum computers have the potential to provide exponential speedups over their classical counterparts. Quantum principles are being applied to fields such as communications, information processing, and artificial intelligence to achieve quantum advantage. However, quantum bits are extremely noisy and prone to decoherence. Thus, keeping the qubits error free is extremely important toward reliable… ▽ More Quantum computers have the potential to provide exponential speedups over their classical counterparts. Quantum principles are being applied to fields such as communications, information processing, and artificial intelligence to achieve quantum advantage. However, quantum bits are extremely noisy and prone to decoherence. Thus, keeping the qubits error free is extremely important toward reliable quantum computing. Quantum error correcting codes have been studied for several decades and methods have been proposed to import classical error correcting codes to the quantum domain. However, circuits for such encoders and decoders haven't been explored in depth. This paper serves as a tutorial on designing and simulating quantum encoder and decoder circuits for stabilizer codes. We present encoding and decoding circuits for five-qubit code and Steane code, along with verification of these circuits using IBM Qiskit. We also provide nearest neighbour compliant encoder and decoder circuits for the five-qubit code. △ Less

Submitted 21 September, 2023; originally announced September 2023.

Journal ref: IEEE Circuits and Systems Magazine, 24(1), pp. 33-51, 2024

arXiv:2309.09035 [pdf, other]

doi 10.1109/ICASSP48485.2024.10447370

A Low-Latency FFT-IFFT Cascade Architecture

Authors: Keshab K. Parhi

Abstract: This paper addresses the design of a partly-parallel cascaded FFT-IFFT architecture that does not require any intermediate buffer. Folding can be used to design partly-parallel architectures for FFT and IFFT. While many cascaded FFT-IFFT architectures can be designed using various folding sets for the FFT and the IFFT, for a specified folded FFT architecture, there exists a unique folding set to d… ▽ More This paper addresses the design of a partly-parallel cascaded FFT-IFFT architecture that does not require any intermediate buffer. Folding can be used to design partly-parallel architectures for FFT and IFFT. While many cascaded FFT-IFFT architectures can be designed using various folding sets for the FFT and the IFFT, for a specified folded FFT architecture, there exists a unique folding set to design the IFFT architecture that does not require an intermediate buffer. Such a folding set is designed by processing the output of the FFT as soon as possible (ASAP) in the folded IFFT. Elimination of the intermediate buffer reduces latency and saves area. The proposed approach is also extended to interleaved processing of multi-channel time-series. The proposed FFT-IFFT cascade architecture saves about N/2 memory elements and N/4 clock cycles of latency compared to a design with identical folding sets. For the 2-interleaved FFT-IFFT cascade, the memory and latency savings are, respectively, N/2 units and N/2 clock cycles, compared to a design with identical folding sets. △ Less

Submitted 16 September, 2023; originally announced September 2023.

Journal ref: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, April 2024

arXiv:2306.12519 [pdf, other]

doi 10.1109/MSP.2024.3368239

Long Polynomial Modular Multiplication using Low-Complexity Number Theoretic Transform

Authors: Sin-Wei Chiu, Keshab K. Parhi

Abstract: This tutorial aims to establish connections between polynomial modular multiplication over a ring to circular convolution and discrete Fourier transform (DFT). The main goal is to extend the well-known theory of DFT in signal processing (SP) to other applications involving polynomials in a ring such as homomorphic encryption (HE). HE allows any third party to operate on the encrypted data without… ▽ More This tutorial aims to establish connections between polynomial modular multiplication over a ring to circular convolution and discrete Fourier transform (DFT). The main goal is to extend the well-known theory of DFT in signal processing (SP) to other applications involving polynomials in a ring such as homomorphic encryption (HE). HE allows any third party to operate on the encrypted data without decrypting it in advance. Since most HE schemes are constructed from the ring-learning with errors (R-LWE) problem, efficient polynomial modular multiplication implementation becomes critical. Any improvement in the execution of these building blocks would have significant consequences for the global performance of HE. This lecture note describes three approaches to implementing long polynomial modular multiplication using the number theoretic transform (NTT): zero-padded convolution, without zero-padding, also referred to as negative wrapped convolution (NWC), and low-complexity NWC (LC-NWC). △ Less

Submitted 22 December, 2023; v1 submitted 21 June, 2023; originally announced June 2023.

Comments: 10 pages

Journal ref: IEEE Signal Processing Magazine, 41(1), pp. 92-102, Jan. 2024

arXiv:2202.09623 [pdf, other]

doi 10.1109/ISCAS48785.2022.9937347

Multi-Channel FFT Architectures Designed via Folding and Interleaving

Authors: Nanda K. Unnikrishnan, Keshab K. Parhi

Abstract: Computing the FFT of a single channel is well understood in the literature. However, computing the FFT of multiple channels in a systematic manner has not been fully addressed. This paper presents a framework to design a family of multi-channel FFT architectures using {\em folding} and {\em interleaving}. Three distinct multi-channel FFT architectures are presented in this paper. These architectur… ▽ More Computing the FFT of a single channel is well understood in the literature. However, computing the FFT of multiple channels in a systematic manner has not been fully addressed. This paper presents a framework to design a family of multi-channel FFT architectures using {\em folding} and {\em interleaving}. Three distinct multi-channel FFT architectures are presented in this paper. These architectures differ in the input and output preprocessing steps and are based on different folding sets, i.e., different orders of execution. △ Less

Submitted 19 February, 2022; originally announced February 2022.

Comments: Proc. 2022 IEEE International Symposium on Circuits and Systems (ISCAS)

Journal ref: Proc. 2022 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 142-146

arXiv:2108.06629 [pdf, other]

doi 10.1109/ICCAD51958.2021.9643567

LayerPipe: Accelerating Deep Neural Network Training by Intra-Layer and Inter-Layer Gradient Pipelining and Multiprocessor Scheduling

Authors: Nanda K. Unnikrishnan, Keshab K. Parhi

Abstract: The time required for training the neural networks increases with size, complexity, and depth. Training model parameters by backpropagation inherently creates feedback loops. These loops hinder efficient pipelining and scheduling of the tasks within the layer and between consecutive layers. Prior approaches, such as PipeDream, have exploited the use of delayed gradient to achieve inter-layer pipel… ▽ More The time required for training the neural networks increases with size, complexity, and depth. Training model parameters by backpropagation inherently creates feedback loops. These loops hinder efficient pipelining and scheduling of the tasks within the layer and between consecutive layers. Prior approaches, such as PipeDream, have exploited the use of delayed gradient to achieve inter-layer pipelining. However, these approaches treat the entire backpropagation as a single task; this leads to an increase in computation time and processor underutilization. This paper presents novel optimization approaches where the gradient computations with respect to the weights and the activation functions are considered independently; therefore, these can be computed in parallel. This is referred to as intra-layer optimization. Additionally, the gradient computation with respect to the activation function is further divided into two parts and distributed to two consecutive layers. This leads to balanced scheduling where the computation time of each layer is the same. This is referred to as inter-layer optimization. The proposed system, referred to as LayerPipe, reduces the number of clock cycles required for training while maximizing processor utilization with minimal inter-processor communication overhead. LayerPipe achieves an average speedup of 25% and upwards of 80% with 7 to 9 processors with less communication overhead when compared to PipeDream. △ Less

Submitted 14 August, 2021; originally announced August 2021.

Comments: Proc. of the 2021 IEEE International Conference on Computer Aided Design (ICCAD)

Journal ref: 2021 IEEE/ACM Conference on Computer Aided Design (ICCAD)

arXiv:2102.00561 [pdf, other]

doi 10.1109/MSP.2021.3052487

Teaching Digital Signal Processing by Partial Flipping, Active Learning and Visualization

Authors: Keshab K. Parhi

Abstract: Effectiveness of teaching digital signal processing can be enhanced by reducing lecture time devoted to theory, and increasing emphasis on applications, programming aspects, visualization and intuitive understanding. An integrated approach to teaching requires instructors to simultaneously teach theory and its applications in storage and processing of audio, speech and biomedical signals. Student… ▽ More Effectiveness of teaching digital signal processing can be enhanced by reducing lecture time devoted to theory, and increasing emphasis on applications, programming aspects, visualization and intuitive understanding. An integrated approach to teaching requires instructors to simultaneously teach theory and its applications in storage and processing of audio, speech and biomedical signals. Student engagement can be enhanced by engaging students to work in groups during the class where students can solve short problems and short programming assignments or take quizzes. These approaches will increase student interest in learning the subject and student engagement. △ Less

Submitted 31 January, 2021; originally announced February 2021.

Comments: IEEE Signal Processing Magazine, 38(3), 2021

Journal ref: IEEE Signal Processing Magazine, 38(3), pp. 20-29, May 2021

arXiv:2004.11204 [pdf, other]

doi 10.1109/MCAS.2020.2988388

Classification using Hyperdimensional Computing: A Review

Authors: Lulu Ge, Keshab K. Parhi

Abstract: Hyperdimensional (HD) computing is built upon its unique data type referred to as hypervectors. The dimension of these hypervectors is typically in the range of tens of thousands. Proposed to solve cognitive tasks, HD computing aims at calculating similarity among its data. Data transformation is realized by three operations, including addition, multiplication and permutation. Its ultra-wide data… ▽ More Hyperdimensional (HD) computing is built upon its unique data type referred to as hypervectors. The dimension of these hypervectors is typically in the range of tens of thousands. Proposed to solve cognitive tasks, HD computing aims at calculating similarity among its data. Data transformation is realized by three operations, including addition, multiplication and permutation. Its ultra-wide data representation introduces redundancy against noise. Since information is evenly distributed over every bit of the hypervectors, HD computing is inherently robust. Additionally, due to the nature of those three operations, HD computing leads to fast learning ability, high energy efficiency and acceptable accuracy in learning and classification tasks. This paper introduces the background of HD computing, and reviews the data representation, data transformation, and similarity measurement. The orthogonality in high dimensions presents opportunities for flexible computing. To balance the tradeoff between accuracy and efficiency, strategies include but are not limited to encoding, retraining, binarization and hardware acceleration. Evaluations indicate that HD computing shows great potential in addressing problems using data in the form of letters, signals and images. HD computing especially shows significant promise to replace machine learning algorithms as a light-weight classifier in the field of internet of things (IoTs). △ Less

Submitted 19 April, 2020; originally announced April 2020.

Comments: IEEE Circuits and Systems Magazine (2020)

Journal ref: IEEE Circuits and Systems Magazine, 20(2), pp. 30-47, June 2020

arXiv:2002.05529 [pdf, other]

doi 10.1109/ISCAS45731.2020.9181242

A Gradient-Interleaved Scheduler for Energy-Efficient Backpropagation for Training Neural Networks

Authors: Nanda Unnikrishnan, Keshab K. Parhi

Abstract: This paper addresses design of accelerators using systolic architectures for training of neural networks using a novel gradient interleaving approach. Training the neural network involves backpropagation of error and computation of gradients with respect to the activation functions and weights. It is shown that the gradient with respect to the activation function can be computed using a weight-sta… ▽ More This paper addresses design of accelerators using systolic architectures for training of neural networks using a novel gradient interleaving approach. Training the neural network involves backpropagation of error and computation of gradients with respect to the activation functions and weights. It is shown that the gradient with respect to the activation function can be computed using a weight-stationary systolic array while the gradient with respect to the weights can be computed using an output-stationary systolic array. The novelty of the proposed approach lies in interleaving the computations of these two gradients to the same configurable systolic array. This results in reuse of the variables from one computation to the other and eliminates unnecessary memory accesses. The proposed approach leads to 1.4 - 2.2 times savings in terms of number of cycles and $1.9 \times$ savings in terms of memory accesses. Thus, the proposed accelerator reduces latency and energy consumption. △ Less

Submitted 12 February, 2020; originally announced February 2020.

Comments: Proc. 2020 IEEE International Symposium on Circuits and Systems (ISCAS)

Journal ref: EEE Trans. Circuits and Systems, Part-I

Showing 1–10 of 10 results for author: Parhi, K K