Zum Hauptinhalt springen

Showing 1–12 of 12 results for author: Kalamkar, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2304.12576  [pdf, other

    cs.DC cs.AI

    Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures

    Authors: Evangelos Georganas, Dhiraj Kalamkar, Kirill Voronin, Abhisek Kundu, Antonio Noack, Hans Pabst, Alexander Breuer, Alexander Heinecke

    Abstract: During the past decade, Deep Learning (DL) algorithms, programming systems and hardware have converged with the High Performance Computing (HPC) counterparts. Nevertheless, the programming methodology of DL and HPC systems is stagnant, relying on highly-optimized, yet platform-specific and inflexible vendor-optimized libraries. Such libraries provide close-to-peak performance on specific platforms… ▽ More

    Submitted 15 March, 2024; v1 submitted 25 April, 2023; originally announced April 2023.

  2. arXiv:2104.08002  [pdf, other

    cs.LG cs.AI cs.DC

    Efficient and Generic 1D Dilated Convolution Layer for Deep Learning

    Authors: Narendra Chaudhary, Sanchit Misra, Dhiraj Kalamkar, Alexander Heinecke, Evangelos Georganas, Barukh Ziv, Menachem Adelman, Bharat Kaul

    Abstract: Convolutional neural networks (CNNs) have found many applications in tasks involving two-dimensional (2D) data, such as image classification and image processing. Therefore, 2D convolution layers have been heavily optimized on CPUs and GPUs. However, in many applications - for example genomics and speech recognition, the data can be one-dimensional (1D). Such applications can benefit from optimize… ▽ More

    Submitted 16 April, 2021; originally announced April 2021.

  3. arXiv:2104.06700  [pdf, other

    cs.LG cs.DC

    DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks

    Authors: Vasimuddin Md, Sanchit Misra, Guixiang Ma, Ramanarayan Mohanty, Evangelos Georganas, Alexander Heinecke, Dhiraj Kalamkar, Nesreen K. Ahmed, Sasikanth Avancha

    Abstract: Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible. It is challenging due to large memory capacity and bandwidth requirements on a single compute node and high communication volumes across multiple nodes. In this paper, we present DistGNN that optimizes the well-known Deep G… ▽ More

    Submitted 16 April, 2021; v1 submitted 14 April, 2021; originally announced April 2021.

  4. Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning & HPC Workloads

    Authors: Evangelos Georganas, Dhiraj Kalamkar, Sasikanth Avancha, Menachem Adelman, Deepti Aggarwal, Cristina Anderson, Alexander Breuer, Jeremy Bruestle, Narendra Chaudhary, Abhisek Kundu, Denise Kutnick, Frank Laub, Vasimuddin Md, Sanchit Misra, Ramanarayan Mohanty, Hans Pabst, Brian Retford, Barukh Ziv, Alexander Heinecke

    Abstract: During the past decade, novel Deep Learning (DL) algorithms, workloads and hardware have been developed to tackle a wide range of problems. Despite the advances in workload and hardware ecosystems, the programming methodology of DL systems is stagnant. DL workloads leverage either highly-optimized, yet platform-specific and inflexible kernels from DL libraries, or in the case of novel operators, r… ▽ More

    Submitted 30 November, 2021; v1 submitted 12 April, 2021; originally announced April 2021.

  5. arXiv:2005.04680  [pdf, other

    cs.DC cs.IR cs.LG cs.PF

    Optimizing Deep Learning Recommender Systems' Training On CPU Cluster Architectures

    Authors: Dhiraj Kalamkar, Evangelos Georganas, Sudarshan Srinivasan, Jianping Chen, Mikhail Shiryaev, Alexander Heinecke

    Abstract: During the last two years, the goal of many researchers has been to squeeze the last bit of performance out of HPC system for AI tasks. Often this discussion is held in the context of how fast ResNet50 can be trained. Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus, we focus on Recommender Systems which account for most of the AI cycles in cloud computing centers. More… ▽ More

    Submitted 10 May, 2020; originally announced May 2020.

  6. arXiv:1909.07729  [pdf, other

    cs.LG cs.NE stat.ML

    K-TanH: Efficient TanH For Deep Learning

    Authors: Abhisek Kundu, Alex Heinecke, Dhiraj Kalamkar, Sudarshan Srinivasan, Eric C. Qin, Naveen K. Mellempudi, Dipankar Das, Kunal Banerjee, Bharat Kaul, Pradeep Dubey

    Abstract: We propose K-TanH, a novel, highly accurate, hardware efficient approximation of popular activation function TanH for Deep Learning. K-TanH consists of parameterized low-precision integer operations, such as, shift and add/subtract (no floating point operation needed) where parameters are stored in very small look-up tables that can fit in CPU registers. K-TanH can work on various numerical format… ▽ More

    Submitted 7 June, 2020; v1 submitted 17 September, 2019; originally announced September 2019.

    Comments: 6 pages, 1 figures

  7. arXiv:1906.06440  [pdf, other

    cs.LG cs.DC stat.ML

    High-Performance Deep Learning via a Single Building Block

    Authors: Evangelos Georganas, Kunal Banerjee, Dhiraj Kalamkar, Sasikanth Avancha, Anand Venkat, Michael Anderson, Greg Henry, Hans Pabst, Alexander Heinecke

    Abstract: Deep learning (DL) is one of the most prominent branches of machine learning. Due to the immense computational cost of DL workloads, industry and academia have developed DL libraries with highly-specialized kernels for each workload/architecture, leading to numerous, complex code-bases that strive for performance, yet they are hard to maintain and do not generalize. In this work, we introduce the… ▽ More

    Submitted 17 June, 2019; v1 submitted 14 June, 2019; originally announced June 2019.

  8. arXiv:1905.12322  [pdf, other

    cs.LG stat.ML

    A Study of BFLOAT16 for Deep Learning Training

    Authors: Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, Pradeep Dubey

    Abstract: This paper presents the first comprehensive empirical study demonstrating the efficacy of the Brain Floating Point (BFLOAT16) half-precision format for Deep Learning training across image classification, speech recognition, language modeling, generative networks and industrial recommendation systems. BFLOAT16 is attractive for Deep Learning training for two reasons: the range of values it can repr… ▽ More

    Submitted 13 June, 2019; v1 submitted 29 May, 2019; originally announced May 2019.

  9. arXiv:1808.05567  [pdf, other

    cs.DC

    Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures

    Authors: Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, Alexander Heinecke

    Abstract: Convolution layers are prevalent in many classes of deep neural networks, including Convolutional Neural Networks (CNNs) which provide state-of-the-art results for tasks like image recognition, neural machine translation and speech recognition. The computationally expensive nature of a convolution operation has led to the proliferation of implementations including matrix-matrix multiplication form… ▽ More

    Submitted 20 August, 2018; v1 submitted 16 August, 2018; originally announced August 2018.

    Comments: Accepted to SC18

  10. arXiv:1802.00930  [pdf, other

    cs.NE cs.LG math.NA

    Mixed Precision Training of Convolutional Neural Networks using Integer Operations

    Authors: Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj Kalamkar, Sasikanth Avancha, Kunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas, Alexander Heinecke, Pradeep Dubey, Jesus Corbal, Nikita Shustrov, Roma Dubtsov, Evarist Fomenko, Vadim Pirogov

    Abstract: The state-of-the-art (SOTA) for mixed precision training is dominated by variants of low precision floating point operations, and in particular, FP16 accumulating into FP32 Micikevicius et al. (2017). On the other hand, while a lot of research has also happened in the domain of low and mixed-precision Integer training, these works either present results for non-SOTA networks (for instance only Ale… ▽ More

    Submitted 23 February, 2018; v1 submitted 3 February, 2018; originally announced February 2018.

    Comments: Published as a conference paper at ICLR 2018

  11. arXiv:1801.08030  [pdf, other

    cs.DC cs.LG

    On Scale-out Deep Learning Training for Cloud and HPC

    Authors: Srinivas Sridharan, Karthikeyan Vaidyanathan, Dhiraj Kalamkar, Dipankar Das, Mikhail E. Smorkalov, Mikhail Shiryaev, Dheevatsa Mudigere, Naveen Mellempudi, Sasikanth Avancha, Bharat Kaul, Pradeep Dubey

    Abstract: The exponential growth in use of large deep neural networks has accelerated the need for training these deep neural networks in hours or even minutes. This can only be achieved through scalable and efficient distributed training, since a single node/card cannot satisfy the compute, memory, and I/O requirements of today's state-of-the-art deep neural networks. However, scaling synchronous Stochasti… ▽ More

    Submitted 24 January, 2018; originally announced January 2018.

    Comments: Accepted in SysML 2018 conference

  12. arXiv:1602.06709  [pdf, other

    cs.DC cs.LG

    Distributed Deep Learning Using Synchronous Stochastic Gradient Descent

    Authors: Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas Sridharan, Dhiraj Kalamkar, Bharat Kaul, Pradeep Dubey

    Abstract: We design and implement a distributed multinode synchronous SGD algorithm, without altering hyper parameters, or compressing data, or altering algorithmic behavior. We perform a detailed analysis of scaling, and identify optimal design points for different networks. We demonstrate scaling of CNNs on 100s of nodes, and present what we believe to be record training throughputs. A 512 minibatch VGG-A… ▽ More

    Submitted 22 February, 2016; originally announced February 2016.