-
Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications
Authors:
Jongsoo Park,
Maxim Naumov,
Protonu Basu,
Summer Deng,
Aravind Kalaiah,
Daya Khudia,
James Law,
Parth Malani,
Andrey Malevich,
Satish Nadathur,
Juan Pino,
Martin Schatz,
Alexander Sidorov,
Viswanath Sivakumar,
Andrew Tulloch,
Xiaodong Wang,
Yiming Wu,
Hector Yuen,
Utku Diril,
Dmytro Dzhulgakov,
Kim Hazelwood,
Bill Jia,
Yangqing Jia,
Lin Qiao,
Vijay Rao
, et al. (3 additional authors not shown)
Abstract:
The application of deep learning techniques resulted in remarkable improvement of machine learning models. In this paper provides detailed characterizations of deep learning models used in many Facebook social network services. We present computational characteristics of our models, describe high performance optimizations targeting existing systems, point out their limitations and make suggestions…
▽ More
The application of deep learning techniques resulted in remarkable improvement of machine learning models. In this paper provides detailed characterizations of deep learning models used in many Facebook social network services. We present computational characteristics of our models, describe high performance optimizations targeting existing systems, point out their limitations and make suggestions for the future general-purpose/accelerated inference hardware. Also, we highlight the need for better co-design of algorithms, numerics and computing platforms to address the challenges of workloads often run in data centers.
△ Less
Submitted 29 November, 2018; v1 submitted 24 November, 2018;
originally announced November 2018.
-
Galactos: Computing the Anisotropic 3-Point Correlation Function for 2 Billion Galaxies
Authors:
Brian Friesen,
Md. Mostofa Ali Patwary,
Brian Austin,
Nadathur Satish,
Zachary Slepian,
Narayanan Sundaram,
Deborah Bard,
Daniel J Eisenstein,
Jack Deslippe,
Pradeep Dubey,
Prabhat
Abstract:
The nature of dark energy and the complete theory of gravity are two central questions currently facing cosmology. A vital tool for addressing them is the 3-point correlation function (3PCF), which probes deviations from a spatially random distribution of galaxies. However, the 3PCF's formidable computational expense has prevented its application to astronomical surveys comprising millions to bill…
▽ More
The nature of dark energy and the complete theory of gravity are two central questions currently facing cosmology. A vital tool for addressing them is the 3-point correlation function (3PCF), which probes deviations from a spatially random distribution of galaxies. However, the 3PCF's formidable computational expense has prevented its application to astronomical surveys comprising millions to billions of galaxies. We present Galactos, a high-performance implementation of a novel, O(N^2) algorithm that uses a load-balanced k-d tree and spherical harmonic expansions to compute the anisotropic 3PCF. Our implementation is optimized for the Intel Xeon Phi architecture, exploiting SIMD parallelism, instruction and thread concurrency, and significant L1 and L2 cache reuse, reaching 39% of peak performance on a single node. Galactos scales to the full Cori system, achieving 9.8PF (peak) and 5.06PF (sustained) across 9636 nodes, making the 3PCF easily computable for all galaxies in the observable universe.
△ Less
Submitted 31 August, 2017;
originally announced September 2017.
-
Deep Learning at 15PF: Supervised and Semi-Supervised Classification for Scientific Data
Authors:
Thorsten Kurth,
Jian Zhang,
Nadathur Satish,
Ioannis Mitliagkas,
Evan Racah,
Mostofa Ali Patwary,
Tareq Malas,
Narayanan Sundaram,
Wahid Bhimji,
Mikhail Smorkalov,
Jack Deslippe,
Mikhail Shiryaev,
Srinivas Sridharan,
Prabhat,
Pradeep Dubey
Abstract:
This paper presents the first, 15-PetaFLOP Deep Learning system for solving scientific pattern classification problems on contemporary HPC architectures. We develop supervised convolutional architectures for discriminating signals in high-energy physics data as well as semi-supervised architectures for localizing and classifying extreme weather in climate data. Our Intelcaffe-based implementation…
▽ More
This paper presents the first, 15-PetaFLOP Deep Learning system for solving scientific pattern classification problems on contemporary HPC architectures. We develop supervised convolutional architectures for discriminating signals in high-energy physics data as well as semi-supervised architectures for localizing and classifying extreme weather in climate data. Our Intelcaffe-based implementation obtains $\sim$2TFLOP/s on a single Cori Phase-II Xeon-Phi node. We use a hybrid strategy employing synchronous node-groups, while using asynchronous communication across groups. We use this strategy to scale training of a single model to $\sim$9600 Xeon-Phi nodes; obtaining peak performance of 11.73-15.07 PFLOP/s and sustained performance of 11.41-13.27 PFLOP/s. At scale, our HEP architecture produces state-of-the-art classification accuracy on a dataset with 10M images, exceeding that achieved by selections on high-level physics-motivated features. Our semi-supervised architecture successfully extracts weather patterns in a 15TB climate dataset. Our results demonstrate that Deep Learning can be optimized and scaled effectively on many-core, HPC systems.
△ Less
Submitted 17 August, 2017;
originally announced August 2017.
-
Banshee: Bandwidth-Efficient DRAM Caching Via Software/Hardware Cooperation
Authors:
Xiangyao Yu,
Christopher J. Hughes,
Nadathur Satish,
Onur Mutlu,
Srinivas Devadas
Abstract:
Putting the DRAM on the same package with a processor enables several times higher memory bandwidth than conventional off-package DRAM. Yet, the latency of in-package DRAM is not appreciably lower than that of off-package DRAM. A promising use of in-package DRAM is as a large cache. Unfortunately, most previous DRAM cache designs mainly optimize for hit latency and do not consider off-chip bandwid…
▽ More
Putting the DRAM on the same package with a processor enables several times higher memory bandwidth than conventional off-package DRAM. Yet, the latency of in-package DRAM is not appreciably lower than that of off-package DRAM. A promising use of in-package DRAM is as a large cache. Unfortunately, most previous DRAM cache designs mainly optimize for hit latency and do not consider off-chip bandwidth efficiency as a first-class design constraint. Hence, as we show in this paper, these designs are suboptimal for use with in-package DRAM.
We propose a new DRAM cache design, Banshee, that optimizes for both in- and off-package DRAM bandwidth efficiency without degrading access latency. The key ideas are to eliminate the in-package DRAM bandwidth overheads due to costly tag accesses through virtual memory mechanism and to incorporate a bandwidth-aware frequency-based replacement policy that is biased to reduce unnecessary traffic to off-package DRAM. Our extensive evaluation shows that Banshee provides significant performance improvement and traffic reduction over state-of-the-art latency-optimized DRAM cache designs.
△ Less
Submitted 9 April, 2017;
originally announced April 2017.
-
Parallelizing Word2Vec in Multi-Core and Many-Core Architectures
Authors:
Shihao Ji,
Nadathur Satish,
Sheng Li,
Pradeep Dubey
Abstract:
Word2vec is a widely used algorithm for extracting low-dimensional vector representations of words. State-of-the-art algorithms including those by Mikolov et al. have been parallelized for multi-core CPU architectures, but are based on vector-vector operations with "Hogwild" updates that are memory-bandwidth intensive and do not efficiently use computational resources. In this paper, we propose "H…
▽ More
Word2vec is a widely used algorithm for extracting low-dimensional vector representations of words. State-of-the-art algorithms including those by Mikolov et al. have been parallelized for multi-core CPU architectures, but are based on vector-vector operations with "Hogwild" updates that are memory-bandwidth intensive and do not efficiently use computational resources. In this paper, we propose "HogBatch" by improving reuse of various data structures in the algorithm through the use of minibatching and negative sample sharing, hence allowing us to express the problem using matrix multiply operations. We also explore different techniques to distribute word2vec computation across nodes in a compute cluster, and demonstrate good strong scalability up to 32 nodes. The new algorithm is particularly suitable for modern multi-core/many-core architectures, especially Intel's latest Knights Landing processors, and allows us to scale up the computation near linearly across cores and nodes, and process hundreds of millions of words per second, which is the fastest word2vec implementation to the best of our knowledge.
△ Less
Submitted 23 December, 2016; v1 submitted 18 November, 2016;
originally announced November 2016.
-
PANDA: Extreme Scale Parallel K-Nearest Neighbor on Distributed Architectures
Authors:
Md. Mostofa Ali Patwary,
Nadathur Rajagopalan Satish,
Narayanan Sundaram,
Jialin Liu,
Peter Sadowski,
Evan Racah,
Suren Byna,
Craig Tull,
Wahid Bhimji,
Prabhat,
Pradeep Dubey
Abstract:
Computing $k$-Nearest Neighbors (KNN) is one of the core kernels used in many machine learning, data mining and scientific computing applications. Although kd-tree based $O(\log n)$ algorithms have been proposed for computing KNN, due to its inherent sequentiality, linear algorithms are being used in practice. This limits the applicability of such methods to millions of data points, with limited s…
▽ More
Computing $k$-Nearest Neighbors (KNN) is one of the core kernels used in many machine learning, data mining and scientific computing applications. Although kd-tree based $O(\log n)$ algorithms have been proposed for computing KNN, due to its inherent sequentiality, linear algorithms are being used in practice. This limits the applicability of such methods to millions of data points, with limited scalability for Big Data analytics challenges in the scientific domain. In this paper, we present parallel and highly optimized kd-tree based KNN algorithms (both construction and querying) suitable for distributed architectures. Our algorithm includes novel approaches for pruning search space and improving load balancing and partitioning among nodes and threads. Using TB-sized datasets from three science applications: astrophysics, plasma physics, and particle physics, we show that our implementation can construct kd-tree of 189 billion particles in 48 seconds on utilizing $\sim$50,000 cores. We also demonstrate computation of KNN of 19 billion queries in 12 seconds. We demonstrate almost linear speedup both for shared and distributed memory computers. Our algorithms outperforms earlier implementations by more than order of magnitude; thereby radically improving the applicability of our implementation to state-of-the-art Big Data analytics problems. In addition, we showcase performance and scalability on the recently released Intel Xeon Phi processor showing that our algorithm scales well even on massively parallel architectures.
△ Less
Submitted 27 July, 2016;
originally announced July 2016.
-
Parallelizing Word2Vec in Shared and Distributed Memory
Authors:
Shihao Ji,
Nadathur Satish,
Sheng Li,
Pradeep Dubey
Abstract:
Word2Vec is a widely used algorithm for extracting low-dimensional vector representations of words. It generated considerable excitement in the machine learning and natural language processing (NLP) communities recently due to its exceptional performance in many NLP applications such as named entity recognition, sentiment analysis, machine translation and question answering. State-of-the-art algor…
▽ More
Word2Vec is a widely used algorithm for extracting low-dimensional vector representations of words. It generated considerable excitement in the machine learning and natural language processing (NLP) communities recently due to its exceptional performance in many NLP applications such as named entity recognition, sentiment analysis, machine translation and question answering. State-of-the-art algorithms including those by Mikolov et al. have been parallelized for multi-core CPU architectures but are based on vector-vector operations that are memory-bandwidth intensive and do not efficiently use computational resources. In this paper, we improve reuse of various data structures in the algorithm through the use of minibatching, hence allowing us to express the problem using matrix multiply operations. We also explore different techniques to distribute word2vec computation across nodes in a compute cluster, and demonstrate good strong scalability up to 32 nodes. In combination, these techniques allow us to scale up the computation near linearly across cores and nodes, and process hundreds of millions of words per second, which is the fastest word2vec implementation to the best of our knowledge.
△ Less
Submitted 8 August, 2016; v1 submitted 15 April, 2016;
originally announced April 2016.
-
BlackOut: Speeding up Recurrent Neural Network Language Models With Very Large Vocabularies
Authors:
Shihao Ji,
S. V. N. Vishwanathan,
Nadathur Satish,
Michael J. Anderson,
Pradeep Dubey
Abstract:
We propose BlackOut, an approximation algorithm to efficiently train massive recurrent neural network language models (RNNLMs) with million word vocabularies. BlackOut is motivated by using a discriminative loss, and we describe a new sampling strategy which significantly reduces computation while improving stability, sample efficiency, and rate of convergence. One way to understand BlackOut is to…
▽ More
We propose BlackOut, an approximation algorithm to efficiently train massive recurrent neural network language models (RNNLMs) with million word vocabularies. BlackOut is motivated by using a discriminative loss, and we describe a new sampling strategy which significantly reduces computation while improving stability, sample efficiency, and rate of convergence. One way to understand BlackOut is to view it as an extension of the DropOut strategy to the output layer, wherein we use a discriminative training loss and a weighted sampling scheme. We also establish close connections between BlackOut, importance sampling, and noise contrastive estimation (NCE). Our experiments, on the recently released one billion word language modeling benchmark, demonstrate scalability and accuracy of BlackOut; we outperform the state-of-the art, and achieve the lowest perplexity scores on this dataset. Moreover, unlike other established methods which typically require GPUs or CPU clusters, we show that a carefully implemented version of BlackOut requires only 1-10 days on a single machine to train a RNNLM with a million word vocabulary and billions of parameters on one billion words. Although we describe BlackOut in the context of RNNLM training, it can be used to any networks with large softmax output layers.
△ Less
Submitted 31 March, 2016; v1 submitted 21 November, 2015;
originally announced November 2015.
-
GraphMat: High performance graph analytics made productive
Authors:
Narayanan Sundaram,
Nadathur Rajagopalan Satish,
Md Mostofa Ali Patwary,
Subramanya R Dulloor,
Satya Gautam Vadlamudi,
Dipankar Das,
Pradeep Dubey
Abstract:
Given the growing importance of large-scale graph analytics, there is a need to improve the performance of graph analysis frameworks without compromising on productivity. GraphMat is our solution to bridge this gap between a user-friendly graph analytics framework and native, hand-optimized code. GraphMat functions by taking vertex programs and mapping them to high performance sparse matrix operat…
▽ More
Given the growing importance of large-scale graph analytics, there is a need to improve the performance of graph analysis frameworks without compromising on productivity. GraphMat is our solution to bridge this gap between a user-friendly graph analytics framework and native, hand-optimized code. GraphMat functions by taking vertex programs and mapping them to high performance sparse matrix operations in the backend. We get the productivity benefits of a vertex programming framework without sacrificing performance. GraphMat is in C++, and we have been able to write a diverse set of graph algorithms in this framework with the same effort compared to other vertex programming frameworks. GraphMat performs 1.2-7X faster than high performance frameworks such as GraphLab, CombBLAS and Galois. It achieves better multicore scalability (13-15X on 24 cores) than other frameworks and is 1.2X off native, hand-optimized code on a variety of different graph algorithms. Since GraphMat performance depends mainly on a few scalable and well-understood sparse matrix operations, GraphMatcan naturally benefit from the trend of increasing parallelism on future hardware.
△ Less
Submitted 24 March, 2015;
originally announced March 2015.
-
Fast Updates on Read-Optimized Databases Using Multi-Core CPUs
Authors:
Jens Krueger,
Changkyu Kim,
Martin Grund,
Nadathur Satish,
David Schwalb,
Jatin Chhugani,
Hasso Plattner,
Pradeep Dubey,
Alexander Zeier
Abstract:
Read-optimized columnar databases use differential updates to handle writes by maintaining a separate write-optimized delta partition which is periodically merged with the read-optimized and compressed main partition. This merge process introduces significant overheads and unacceptable downtimes in update intensive systems, aspiring to combine transactional and analytical workloads into one system…
▽ More
Read-optimized columnar databases use differential updates to handle writes by maintaining a separate write-optimized delta partition which is periodically merged with the read-optimized and compressed main partition. This merge process introduces significant overheads and unacceptable downtimes in update intensive systems, aspiring to combine transactional and analytical workloads into one system. In the first part of the paper, we report data analyses of 12 SAP Business Suite customer systems. In the second half, we present an optimized merge process reducing the merge overhead of current systems by a factor of 30. Our linear-time merge algorithm exploits the underlying high compute and bandwidth resources of modern multi-core CPUs with architecture-aware optimizations and efficient parallelization. This enables compressed in-memory column stores to handle the transactional update rate required by enterprise applications, while keeping properties of read-optimized databases for analytic-style queries.
△ Less
Submitted 30 September, 2011;
originally announced September 2011.