Search | arXiv e-print repository

Palu: Compressing KV-Cache with Low-Rank Projection

Authors: Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu

Abstract: KV-Cache compression methods generally sample a KV-Cache of effectual tokens or quantize it into lower bits. However, these methods cannot exploit the redundancy of the hidden dimension of KV tensors. This paper investigates a unique hidden dimension approach called Palu, a novel KV-Cache compression framework that utilizes low-rank projection. Palu decomposes the linear layers into low-rank matri… ▽ More KV-Cache compression methods generally sample a KV-Cache of effectual tokens or quantize it into lower bits. However, these methods cannot exploit the redundancy of the hidden dimension of KV tensors. This paper investigates a unique hidden dimension approach called Palu, a novel KV-Cache compression framework that utilizes low-rank projection. Palu decomposes the linear layers into low-rank matrices, caches the smaller intermediate states, and reconstructs the full keys and values on the fly. To improve accuracy, compression rate, and efficiency, Palu further encompasses (1) a medium-grained low-rank decomposition scheme, (2) an efficient rank search algorithm, (3) a low-rank-aware quantization algorithm, and (4) matrix fusion with optimized GPU kernels. Our extensive experiments with popular LLMs show that Palu can compress KV-Cache by more than 91.25% while maintaining a significantly better accuracy (up to 1.19 lower perplexity) than state-of-the-art KV-Cache quantization methods at a similar or even higher memory usage. When compressing KV-Cache for 50%, Palu delivers up to 1.61x end-to-end speedup for the attention module. Our code is publicly available at https://github.com/shadowpa0327/Palu. △ Less

Submitted 30 July, 2024; originally announced July 2024.

arXiv:2406.06542 [pdf, other]

vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs

Authors: Size Zheng, Renze Chen, Meng Li, Zihao Ye, Luis Ceze, Yun Liang

Abstract: IoT devices based on microcontroller units (MCU) provide ultra-low power consumption and ubiquitous computation for near-sensor deep learning models (DNN). However, the memory of MCU is usually 2-3 orders of magnitude smaller than mobile devices, which makes it challenging to map DNNs onto MCUs. Previous work separates memory management and kernel implementation for MCU and relies on coarse-graine… ▽ More IoT devices based on microcontroller units (MCU) provide ultra-low power consumption and ubiquitous computation for near-sensor deep learning models (DNN). However, the memory of MCU is usually 2-3 orders of magnitude smaller than mobile devices, which makes it challenging to map DNNs onto MCUs. Previous work separates memory management and kernel implementation for MCU and relies on coarse-grained memory management techniques such as inplace update to reduce memory consumption. In this paper, we propose to coordinate memory management and kernel optimization for DNN inference on MCUs to enable fine-grained memory management. The key idea is to virtualize the limited memory of MCU as a large memory pool. Each kernel divides the memory pool into kernel-specific segments and handles segment load and store while computing DNN layers. Memory consumption can be reduced because using the fine-grained segment-level memory control, we can overlap the memory footprint of different tensors without the need to materialize them at the same time. Following this idea, we implement \ours{} for DNN inference on MCU. Evaluation for single layers on ARM Cortex-M4 and Cortex-M7 processors shows that \ours{} can reduce from $12.0\%$ to $49.5\%$ RAM usage and from $20.6\%$ to $53.0\%$ energy consumption compared to state-of-the-art work. For full DNN evaluation, \ours{} can reduce the memory bottleneck by $61.5\%$, enabling more models to be deployed on low-end MCUs. △ Less

Submitted 1 May, 2024; originally announced June 2024.

arXiv:2310.19102 [pdf, other]

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Authors: Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci

Abstract: The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption a… ▽ More The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance. To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracy by applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization in the serving context. Atom improves end-to-end throughput (token/s) by up to $7.7\times$ compared to the FP16 and by $2.5\times$ compared to INT8 quantization, while maintaining the same latency target. △ Less

Submitted 16 April, 2024; v1 submitted 29 October, 2023; originally announced October 2023.

arXiv:2310.18547 [pdf, other]

Punica: Multi-Tenant LoRA Serving

Authors: Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy

Abstract: Low-rank adaptation (LoRA) has become an important and popular method to adapt pre-trained models to specific domains. We present Punica, a system to serve multiple LoRA models in a shared GPU cluster. Punica contains a new CUDA kernel design that allows batching of GPU operations for different LoRA models. This allows a GPU to hold only a single copy of the underlying pre-trained model when servi… ▽ More Low-rank adaptation (LoRA) has become an important and popular method to adapt pre-trained models to specific domains. We present Punica, a system to serve multiple LoRA models in a shared GPU cluster. Punica contains a new CUDA kernel design that allows batching of GPU operations for different LoRA models. This allows a GPU to hold only a single copy of the underlying pre-trained model when serving multiple, different LoRA models, significantly enhancing GPU efficiency in terms of both memory and computation. Our scheduler consolidates multi-tenant LoRA serving workloads in a shared GPU cluster. With a fixed-sized GPU cluster, our evaluations show that Punica achieves 12x higher throughput in serving multiple LoRA models compared to state-of-the-art LLM serving systems while only adding 2ms latency per token. Punica is open source at https://github.com/punica-ai/punica . △ Less

Submitted 27 October, 2023; originally announced October 2023.

arXiv:2207.04606 [pdf, other]

SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning

Authors: Zihao Ye, Ruihang Lai, Junru Shao, Tianqi Chen, Luis Ceze

Abstract: Sparse tensors are rapidly becoming critical components of modern deep learning workloads. However, developing high-performance sparse operators can be difficult and tedious, and existing vendor libraries cannot satisfy the escalating demands from new operators. Sparse tensor compilers simplify the development of operators, but efficient sparse compilation for deep learning remains challenging bec… ▽ More Sparse tensors are rapidly becoming critical components of modern deep learning workloads. However, developing high-performance sparse operators can be difficult and tedious, and existing vendor libraries cannot satisfy the escalating demands from new operators. Sparse tensor compilers simplify the development of operators, but efficient sparse compilation for deep learning remains challenging because a single sparse format cannot maximize hardware efficiency, and single-shot compilers cannot keep up with latest hardware and system advances. In this paper, we observe that the key to addressing both these challenges is to leverage composable formats and composable transformations. We propose SparseTIR, a sparse tensor compilation abstraction that offers composable formats and composable transformations for deep learning workloads. SparseTIR constructs a search space over these composable components for performance tuning. With these improvements, SparseTIR obtains consistent performance speedups vs vendor libraries on GPUs for single operators: 1.20-2.34x for GNN operators, 1.05-2.98x for sparse attention operators, and 0.56-7.45x for sparse convolution operators. SparseTIR also accelerates end-to-end GNNs by 1.08-1.52x for GraphSAGE training, and 4.20-40.18x for RGCN inference. △ Less

Submitted 21 February, 2023; v1 submitted 10 July, 2022; originally announced July 2022.

Comments: To appear at ASPLOS 2023 (19 pages, 23 figures), source code available at https://github.com/uwsampl/sparsetir, artifact available at https://github.com/uwsampl/sparsetir-artifact

arXiv:2110.14819 [pdf, other]

Characterizing and Taming Resolution in Convolutional Neural Networks

Authors: Eddie Yan, Liang Luo, Luis Ceze

Abstract: Image resolution has a significant effect on the accuracy and computational, storage, and bandwidth costs of computer vision model inference. These costs are exacerbated when scaling out models to large inference serving systems and make image resolution an attractive target for optimization. However, the choice of resolution inherently introduces additional tightly coupled choices, such as image… ▽ More Image resolution has a significant effect on the accuracy and computational, storage, and bandwidth costs of computer vision model inference. These costs are exacerbated when scaling out models to large inference serving systems and make image resolution an attractive target for optimization. However, the choice of resolution inherently introduces additional tightly coupled choices, such as image crop size, image detail, and compute kernel implementation that impact computational, storage, and bandwidth costs. Further complicating this setting, the optimal choices from the perspective of these metrics are highly dependent on the dataset and problem scenario. We characterize this tradeoff space, quantitatively studying the accuracy and efficiency tradeoff via systematic and automated tuning of image resolution, image quality and convolutional neural network operators. With the insights from this study, we propose a dynamic resolution mechanism that removes the need to statically choose a resolution ahead of time. △ Less

Submitted 27 October, 2021; originally announced October 2021.

arXiv:2105.14088 [pdf, other]

Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering

Authors: Liang Luo, Jacob Nelson, Arvind Krishnamurthy, Luis Ceze

Abstract: ML workloads are becoming increasingly popular in the cloud. Good cloud training performance is contingent on efficient parameter exchange among VMs. We find that Collectives, the widely used distributed communication algorithms, cannot perform optimally out of the box due to the hierarchical topology of datacenter networks and multi-tenancy nature of the cloudenvironment.In this paper, we present… ▽ More ML workloads are becoming increasingly popular in the cloud. Good cloud training performance is contingent on efficient parameter exchange among VMs. We find that Collectives, the widely used distributed communication algorithms, cannot perform optimally out of the box due to the hierarchical topology of datacenter networks and multi-tenancy nature of the cloudenvironment.In this paper, we present Cloud Collectives , a prototype that accelerates collectives by reordering theranks of participating VMs such that the communication pattern dictated by the selected collectives operation best exploits the locality in the network.Collectives is non-intrusive, requires no code changes nor rebuild of an existing application, and runs without support from cloud providers. Our preliminary application of Cloud Collectives on allreduce operations in public clouds results in a speedup of up to 3.7x in multiple microbenchmarks and 1.3x in real-world workloads of distributed training of deep neural networks and gradient boosted decision trees using state-of-the-art frameworks. △ Less

Submitted 28 May, 2021; originally announced May 2021.

arXiv:2105.09377 [pdf, other]

doi 10.1145/3460945.3464953

Pure Tensor Program Rewriting via Access Patterns (Representation Pearl)

Authors: Gus Henry Smith, Andrew Liu, Steven Lyubomirsky, Scott Davidson, Joseph McMahan, Michael Taylor, Luis Ceze, Zachary Tatlock

Abstract: Tensor kernels in machine learning (ML) often correspond to pure mathematical expressions, making term rewriting an attractive strategy for optimization and mapping to specialized hardware accelerators. However, existing ML intermediate representations (IRs) tend to either be \textit{pure but high-level}, making low-level rewrites to hardware targets inexpressible, or \textit{low-level but impure}… ▽ More Tensor kernels in machine learning (ML) often correspond to pure mathematical expressions, making term rewriting an attractive strategy for optimization and mapping to specialized hardware accelerators. However, existing ML intermediate representations (IRs) tend to either be \textit{pure but high-level}, making low-level rewrites to hardware targets inexpressible, or \textit{low-level but impure}, hampering the use of term rewriting altogether. This paper introduces Glenside, a pure IR whose core abstraction -- the \textit{access pattern} -- enables low-level, layout-aware, hardware-centric program rewrites. We demonstrate how term rewriting in Glenside can be used to map program fragments to hardware accelerator invocations and automatically discover classic data layout transformations like \texttt{im2col}. Glenside establishes a new foundation for exploring further term rewriting techniques in optimizing low-level tensor programs. △ Less

Submitted 19 May, 2021; originally announced May 2021.

Comments: To be published at MAPS 2021

arXiv:2104.10716 [pdf, other]

Accelerating SpMM Kernel with Cache-First Edge Sampling for Graph Neural Networks

Authors: Chien-Yu Lin, Liang Luo, Luis Ceze

Abstract: Graph neural networks (GNNs), an emerging deep learning model class, can extract meaningful representations from highly expressive graph-structured data and are therefore gaining popularity for wider ranges of applications. However, current GNNs suffer from the poor performance of their sparse-dense matrix multiplication (SpMM) operator, even when using powerful GPUs. Our analysis shows that 95% o… ▽ More Graph neural networks (GNNs), an emerging deep learning model class, can extract meaningful representations from highly expressive graph-structured data and are therefore gaining popularity for wider ranges of applications. However, current GNNs suffer from the poor performance of their sparse-dense matrix multiplication (SpMM) operator, even when using powerful GPUs. Our analysis shows that 95% of the inference time could be spent on SpMM when running popular GNN models on NVIDIA's advanced V100 GPU. Such SpMM performance bottleneck hinders GNNs' applicability to large-scale problems or the development of more sophisticated GNN models. To address this inference time bottleneck, we introduce ES-SpMM, a cache-first edge sampling mechanism and codesigned SpMM kernel. ES-SpMM uses edge sampling to downsize the graph to fit into GPU's shared memory. It thus reduces the computation cost and improves SpMM's cache locality. To evaluate ES-SpMM's performance, we integrated it with a popular GNN framework, DGL, and tested it using representative GNN models and datasets. Our results show that ES-SpMM outperforms the highly optimized cuSPARSE SpMM kernel by up to 4.35x with no accuracy loss and by 45.3x with less than a 1% accuracy loss. △ Less

Submitted 23 April, 2021; v1 submitted 21 April, 2021; originally announced April 2021.

arXiv:2103.16604 [pdf, other]

VSS: A Storage System for Video Analytics [Technical Report]

Authors: Brandon Haynes, Maureen Daum, Dong He, Amrita Mazumdar, Magdalena Balazinska, Alvin Cheung, Luis Ceze

Abstract: We present a new video storage system (VSS) designed to decouple high-level video operations from the low-level details required to store and efficiently retrieve video data. VSS is designed to be the storage subsystem of a video data management system (VDBMS) and is responsible for: (1) transparently and automatically arranging the data on disk in an efficient, granular format; (2) caching freque… ▽ More We present a new video storage system (VSS) designed to decouple high-level video operations from the low-level details required to store and efficiently retrieve video data. VSS is designed to be the storage subsystem of a video data management system (VDBMS) and is responsible for: (1) transparently and automatically arranging the data on disk in an efficient, granular format; (2) caching frequently-retrieved regions in the most useful formats; and (3) eliminating redundancies found in videos captured from multiple cameras with overlapping fields of view. Our results suggest that VSS can improve VDBMS read performance by up to 54%, reduce storage costs by up to 45%, and enable developers to focus on application logic rather than video storage and retrieval. △ Less

Submitted 30 March, 2021; originally announced March 2021.

arXiv:2103.14949 [pdf, other]

Automated Backend-Aware Post-Training Quantization

Authors: Ziheng Jiang, Animesh Jain, Andrew Liu, Josh Fromm, Chengqian Ma, Tianqi Chen, Luis Ceze

Abstract: Quantization is a key technique to reduce the resource requirement and improve the performance of neural network deployment. However, different hardware backends such as x86 CPU, NVIDIA GPU, ARM CPU, and accelerators may demand different implementations for quantized networks. This diversity calls for specialized post-training quantization pipelines to built for each hardware target, an engineerin… ▽ More Quantization is a key technique to reduce the resource requirement and improve the performance of neural network deployment. However, different hardware backends such as x86 CPU, NVIDIA GPU, ARM CPU, and accelerators may demand different implementations for quantized networks. This diversity calls for specialized post-training quantization pipelines to built for each hardware target, an engineering effort that is often too large for developers to keep up with. We tackle this problem with an automated post-training quantization framework called HAGO. HAGO provides a set of general quantization graph transformations based on a user-defined hardware specification and implements a search mechanism to find the optimal quantization strategy while satisfying hardware constraints for any model. We observe that HAGO achieves speedups of 2.09x, 1.97x, and 2.48x on Intel Xeon Cascade Lake CPUs, NVIDIA Tesla T4 GPUs, ARM Cortex-A CPUs on Raspberry Pi4 relative to full precision respectively, while maintaining the highest reported post-training quantization accuracy in each case. △ Less

Submitted 27 March, 2021; originally announced March 2021.

arXiv:2011.14243 [pdf, other]

Srifty: Swift and Thrifty Distributed Training on the Cloud

Authors: Liang Luo, Peter West, Arvind Krishnamurthy, Luis Ceze

Abstract: Finding the best VM configuration is key to achieve lower cost and higher throughput, two primary concerns in cloud-based distributed neural network (NN) training today. Optimal VM selection that meets user constraints requires efficiently navigating a large search space while controlling for the performance variance associated with sharing cloud instances and networks. In this work, we characteri… ▽ More Finding the best VM configuration is key to achieve lower cost and higher throughput, two primary concerns in cloud-based distributed neural network (NN) training today. Optimal VM selection that meets user constraints requires efficiently navigating a large search space while controlling for the performance variance associated with sharing cloud instances and networks. In this work, we characterize this variance in the context of distributed NN training and present results of a comprehensive throughput and cost-efficiency study we conducted across a wide array of instances to prune for the optimal VM search space. Using insights from these studies, we built Srifty, a system that combines runtime profiling with learned performance models to accurately predict training performance and find the best VM choice that satisfies user constraints, potentially leveraging both heterogeneous setups and spot instances. We integrated Srifty with PyTorch and evaluated it on Amazon EC2. We conducted a large-scale generalization study of Srifty across more than 2K training setups on EC2. Our results show that Srifty achieves an iteration latency prediction error of 8%, and its VM instance recommendations offer significant throughput gain and cost reduction while satisfying user constraints compared to existing solutions in complex, real-world scenarios. △ Less

Submitted 1 July, 2022; v1 submitted 28 November, 2020; originally announced November 2020.

arXiv:2003.00290 [pdf, other]

Enumerating Hardware-Software Splits with Program Rewriting

Authors: Gus Smith, Zachary Tatlock, Luis Ceze

Abstract: A core problem in hardware-software codesign is in the sheer size of the design space. Without a set ISA to constrain the hardware-software interface, the design space explodes. This work presents a strategy for managing the massive hardware-software design space within the domain of machine learning inference workloads and accelerators. We first propose EngineIR, a new language for representing m… ▽ More A core problem in hardware-software codesign is in the sheer size of the design space. Without a set ISA to constrain the hardware-software interface, the design space explodes. This work presents a strategy for managing the massive hardware-software design space within the domain of machine learning inference workloads and accelerators. We first propose EngineIR, a new language for representing machine learning hardware and software in a single program. Then, using equality graphs -- a data structure from the compilers literature -- we suggest a method for efficiently enumerating the design space by performing rewrites over our representation. △ Less

Submitted 29 February, 2020; originally announced March 2020.

Comments: Accepted in the Second Young Architect Workshop, in conjunction with ASPLOS 2020

arXiv:1902.05971 [pdf, other]

Synthesizing Number Generators for Stochastic Computing using Mixed Integer Programming

Authors: Vincent T. Lee, Samuel Archibald Elliot, Armin Alaghi, Luis Ceze

Abstract: Stochastic computing (SC) is a high density, low-power computation technique which encodes values as unary bitstreams instead of binary-encoded (BE) values. Practical SC implementations require deterministic or pseudo-random number sequences which are optimally correlated to generate bitstreams and achieve accurate results. Unfortunately, the size of the search space makes manually designing optim… ▽ More Stochastic computing (SC) is a high density, low-power computation technique which encodes values as unary bitstreams instead of binary-encoded (BE) values. Practical SC implementations require deterministic or pseudo-random number sequences which are optimally correlated to generate bitstreams and achieve accurate results. Unfortunately, the size of the search space makes manually designing optimally correlated number sequences a difficult task. To automate this design burden, we propose a synthesis formulation using mixed integer programming to automatically generate optimally correlated number sequences. In particular, our synthesis formulation improves the accuracy of arithmetic operations such as multiplication and squaring circuits by up to 2.5x and 20x respectively. We also show how our technique can be extended to scale to larger circuits. △ Less

Submitted 26 February, 2019; v1 submitted 15 February, 2019; originally announced February 2019.

Comments: 6 pages, 5 figures, 3 tables

arXiv:1902.01372 [pdf, other]

Vignette: Perceptual Compression for Video Storage and Processing Systems

Authors: Amrita Mazumdar, Brandon Haynes, Magdalena Balazinska, Luis Ceze, Alvin Cheung, Mark Oskin

Abstract: Compressed videos constitute 70% of Internet traffic, and video upload growth rates far outpace compute and storage improvement trends. Past work in leveraging perceptual cues like saliency, i.e., regions where viewers focus their perceptual attention, reduces compressed video size while maintaining perceptual quality, but requires significant changes to video codecs and ignores the data managemen… ▽ More Compressed videos constitute 70% of Internet traffic, and video upload growth rates far outpace compute and storage improvement trends. Past work in leveraging perceptual cues like saliency, i.e., regions where viewers focus their perceptual attention, reduces compressed video size while maintaining perceptual quality, but requires significant changes to video codecs and ignores the data management of this perceptual information. In this paper, we propose Vignette, a compression technique and storage manager for perception-based video compression. Vignette complements off-the-shelf compression software and hardware codec implementations. Vignette's compression technique uses a neural network to predict saliency information used during transcoding, and its storage manager integrates perceptual information into the video storage system to support a perceptual compression feedback loop. Vignette's saliency-based optimizations reduce storage by up to 95% with minimal quality loss, and Vignette videos lead to power savings of 50% on mobile phones during video playback. Our results demonstrate the benefit of embedding information about the human visual system into the architecture of video storage systems. △ Less

Submitted 4 February, 2019; originally announced February 2019.

arXiv:1810.11066 [pdf, other]

Automating Generation of Low Precision Deep Learning Operators

Authors: Meghan Cowan, Thierry Moreau, Tianqi Chen, Luis Ceze

Abstract: State of the art deep learning models have made steady progress in the fields of computer vision and natural language processing, at the expense of growing model sizes and computational complexity. Deploying these models on low power and mobile devices poses a challenge due to their limited compute capabilities and strict energy budgets. One solution that has generated significant research interes… ▽ More State of the art deep learning models have made steady progress in the fields of computer vision and natural language processing, at the expense of growing model sizes and computational complexity. Deploying these models on low power and mobile devices poses a challenge due to their limited compute capabilities and strict energy budgets. One solution that has generated significant research interest is deploying highly quantized models that operate on low precision inputs and weights less than eight bits, trading off accuracy for performance. These models have a significantly reduced memory footprint (up to 32x reduction) and can replace multiply-accumulates with bitwise operations during compute intensive convolution and fully connected layers. Most deep learning frameworks rely on highly engineered linear algebra libraries such as ATLAS or Intel's MKL to implement efficient deep learning operators. To date, none of the popular deep learning directly support low precision operators, partly due to a lack of optimized low precision libraries. In this paper we introduce a work flow to quickly generate high performance low precision deep learning operators for arbitrary precision that target multiple CPU architectures and include optimizations such as memory tiling and vectorization. We present an extensive case study on low power ARM Cortex-A53 CPU, and show how we can generate 1-bit, 2-bit convolutions with speedups up to 16x over an optimized 16-bit integer baseline and 2.3x better than handwritten implementations. △ Less

Submitted 25 October, 2018; originally announced October 2018.

Comments: 10 pages, 11 figures

arXiv:1810.04756 [pdf, other]

Stochastic Synthesis for Stochastic Computing

Authors: Vincent T. Lee, Armin Alaghi, Luis Ceze, Mark Oskin

Abstract: Stochastic computing (SC) is an emerging computing technique which offers higher computational density, and lower power over binary-encoded (BE) computation. Unlike BE computation, SC encodes values as probabilistic bitstreams which makes designing new circuits unintuitive. Existing techniques for synthesizing SC circuits are limited to specific classes of functions such as polynomial evaluation o… ▽ More Stochastic computing (SC) is an emerging computing technique which offers higher computational density, and lower power over binary-encoded (BE) computation. Unlike BE computation, SC encodes values as probabilistic bitstreams which makes designing new circuits unintuitive. Existing techniques for synthesizing SC circuits are limited to specific classes of functions such as polynomial evaluation or constant scaling. In this paper, we propose using stochastic synthesis, which is originally a program synthesis technique, to automate the task of synthesizing new SC circuits. Our results show stochastic synthesis is more general than past techniques and can synthesize manually designed SC circuits as well as new ones such as an approximate square root unit. △ Less

Submitted 10 October, 2018; originally announced October 2018.

Comments: 7 pages, 4 figures, 3 tables

arXiv:1810.02895 [pdf, ps, other]

Computer Security Risks of Distant Relative Matching in Consumer Genetic Databases

Authors: Peter M. Ney, Luis Ceze, Tadayoshi Kohno

Abstract: Consumer genetic testing has become immensely popular in recent years and has lead to the creation of large scale genetic databases containing millions of dense autosomal genotype profiles. One of the most used features offered by genetic databases is the ability to find distant relatives using a technique called relative matching (or DNA matching). Recently, novel uses of relative matching were d… ▽ More Consumer genetic testing has become immensely popular in recent years and has lead to the creation of large scale genetic databases containing millions of dense autosomal genotype profiles. One of the most used features offered by genetic databases is the ability to find distant relatives using a technique called relative matching (or DNA matching). Recently, novel uses of relative matching were discovered that combined matching results with genealogical information to solve criminal cold cases. New estimates suggest that relative matching, combined with simple demographic information, could be used to re-identify a significant percentage of US Caucasian individuals. In this work we attempt to systematize computer security and privacy risks from relative matching and describe new security problems that can occur if an attacker uploads manipulated or forged genetic profiles. For example, forged profiles can be used by criminals to misdirect investigations, con-artists to defraud victims, or political operatives to blackmail opponents. We discuss solutions to mitigate these threats, including existing proposals to use digital signatures, and encourage the consumer genetics community to consider the broader security implications of relative matching now that it is becoming so prominent. △ Less

Submitted 5 October, 2018; originally announced October 2018.

arXiv:1807.04188 [pdf, other]

A Hardware-Software Blueprint for Flexible Deep Learning Specialization

Authors: Thierry Moreau, Tianqi Chen, Luis Vega, Jared Roesch, Eddie Yan, Lianmin Zheng, Josh Fromm, Ziheng Jiang, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy

Abstract: Specialized Deep Learning (DL) acceleration stacks, designed for a specific set of frameworks, model architectures, operators, and data types, offer the allure of high performance while sacrificing flexibility. Changes in algorithms, models, operators, or numerical systems threaten the viability of specialized hardware accelerators. We propose VTA, a programmable deep learning architecture templat… ▽ More Specialized Deep Learning (DL) acceleration stacks, designed for a specific set of frameworks, model architectures, operators, and data types, offer the allure of high performance while sacrificing flexibility. Changes in algorithms, models, operators, or numerical systems threaten the viability of specialized hardware accelerators. We propose VTA, a programmable deep learning architecture template designed to be extensible in the face of evolving workloads. VTA achieves this flexibility via a parametrizable architecture, two-level ISA, and a JIT compiler. The two-level ISA is based on (1) a task-ISA that explicitly orchestrates concurrent compute and memory tasks and (2) a microcode-ISA which implements a wide variety of operators with single-cycle tensor-tensor operations. Next, we propose a runtime system equipped with a JIT compiler for flexible code-generation and heterogeneous execution that enables effective use of the VTA architecture. VTA is integrated and open-sourced into Apache TVM, a state-of-the-art deep learning compilation stack that provides flexibility for diverse models and divergent hardware backends. We propose a flow that performs design space exploration to generate a customized hardware architecture and software operator library that can be leveraged by mainstream learning frameworks. We demonstrate our approach by deploying optimized deep learning models used for object classification and style transfer on edge-class FPGAs. △ Less

Submitted 22 April, 2019; v1 submitted 11 July, 2018; originally announced July 2018.

Comments: 6 pages plus references, 8 figures

arXiv:1805.08166 [pdf, other]

Learning to Optimize Tensor Programs

Authors: Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy

Abstract: We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective deep learning systems. However, existing systems rely on manually optimized libraries such as cuDNN where only a narrow range of server class GPUs are well-suppor… ▽ More We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective deep learning systems. However, existing systems rely on manually optimized libraries such as cuDNN where only a narrow range of server class GPUs are well-supported. The reliance on hardware-specific operator libraries limits the applicability of high-level graph optimizations and incurs significant engineering costs when deploying to new hardware targets. We use learning to remove this engineering burden. We learn domain-specific statistical cost models to guide the search of tensor operator implementations over billions of possible program variants. We further accelerate the search by effective model transfer across workloads. Experimental results show that our framework delivers performance competitive with state-of-the-art hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPU. △ Less

Submitted 8 January, 2019; v1 submitted 21 May, 2018; originally announced May 2018.

Comments: NeurIPS 2018

arXiv:1805.07891 [pdf, other]

doi 10.1145/3267809.3267840

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training

Authors: Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee, Arvind Krishnamurthy

Abstract: Distributed deep neural network (DDNN) training constitutes an increasingly important workload that frequently runs in the cloud. Larger DNN models and faster compute engines are shifting DDNN training bottlenecks from computation to communication. This paper characterizes DDNN training to precisely pinpoint these bottlenecks. We found that timely training requires high performance parameter serve… ▽ More Distributed deep neural network (DDNN) training constitutes an increasingly important workload that frequently runs in the cloud. Larger DNN models and faster compute engines are shifting DDNN training bottlenecks from computation to communication. This paper characterizes DDNN training to precisely pinpoint these bottlenecks. We found that timely training requires high performance parameter servers (PSs) with optimized network stacks and gradient processing pipelines, as well as server and network hardware with balanced computation and communication resources. We therefore propose PHub, a high performance multi-tenant, rack-scale PS design. PHub co-designs the PS software and hardware to accelerate rack-level and hierarchical cross-rack parameter exchange, with an API compatible with many DDNN training frameworks. PHub provides a performance improvement of up to 2.7x compared to state-of-the-art distributed training techniques for cloud-based ImageNet workloads, with 25% better throughput per dollar. △ Less

Submitted 17 January, 2020; v1 submitted 21 May, 2018; originally announced May 2018.

arXiv:1803.04862 [pdf, other]

Correlation Manipulating Circuits for Stochastic Computing

Authors: Vincent T. Lee, Armin Alaghi, Luis Ceze

Abstract: Stochastic computing (SC) is an emerging computing technique that promises high density, low power, and error tolerant solutions. In SC, values are encoded as unary bitstreams and SC arithmetic circuits operate on one or more bitstreams. In many cases, the input bitstreams must be correlated or uncorrelated for SC arithmetic to produce accurate results. As a result, a key challenge for designing S… ▽ More Stochastic computing (SC) is an emerging computing technique that promises high density, low power, and error tolerant solutions. In SC, values are encoded as unary bitstreams and SC arithmetic circuits operate on one or more bitstreams. In many cases, the input bitstreams must be correlated or uncorrelated for SC arithmetic to produce accurate results. As a result, a key challenge for designing SC accelerators is manipulating the impact of correlation across SC operations. This paper presents and evaluates a set of novel correlation manipulating circuits to manage correlation in SC computation: a synchronizer, desynchronizer, and decorrelator. We then use these circuits to propose improved SC maximum, minimum, and saturating adder designs. Compared to existing correlation manipulation techniques, our circuits are more accurate and up to 3x more energy efficient. In the context of an image processing pipeline, these circuits can reduce the total energy consumption by up to 24%. △ Less

Submitted 1 March, 2018; originally announced March 2018.

Comments: 6 pages, 5 figures, 4 tables, Design, Automation and Test in Europe Conference and Exhibition (2018)

arXiv:1802.04799 [pdf, other]

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Authors: Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy

Abstract: There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -- such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -- requires significant manual effort. We propose TVM, a compiler that… ▽ More There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -- such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -- requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator. The system is open sourced and in production use inside several major companies. △ Less

Submitted 5 October, 2018; v1 submitted 12 February, 2018; originally announced February 2018.

Comments: Significantly improved version, add automated optimization

arXiv:1801.09805 [pdf, other]

Parameter Box: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training

Authors: Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee, Arvind Krishnamurthy

Abstract: Most work in the deep learning systems community has focused on faster inference, but arriving at a trained model requires lengthy experiments. Accelerating training lets developers iterate faster and come up with better models. DNN training is often seen as a compute-bound problem, best done in a single large compute node with many GPUs. As DNNs get bigger, training requires going distributed. Di… ▽ More Most work in the deep learning systems community has focused on faster inference, but arriving at a trained model requires lengthy experiments. Accelerating training lets developers iterate faster and come up with better models. DNN training is often seen as a compute-bound problem, best done in a single large compute node with many GPUs. As DNNs get bigger, training requires going distributed. Distributed deep neural network (DDNN) training constitutes an important workload on the cloud. Larger DNN models and faster compute engines shift the training performance bottleneck from computation to communication. Our experiments show existing DNN training frameworks do not scale in a typical cloud environment due to insufficient bandwidth and inefficient parameter server software stacks.We propose PBox, a balanced, scalable central PS hardware that balances compute and communication resources, and PHub, a high performance parameter server (PS) software design that provides an optimized network stack and a streamlined gradient processing pipeline to benefit common PS setups to utilize PBox. We show that in a typical cloud environment, PBox can achieve up to 3.8x speedup over state-of-the-art designs when training ImageNet. We discuss future directions of integrating PBox with programmable switches for in-network aggregation during training, leveraging the datacenter network topology to reduce bandwidth usage and localize data movement. △ Less

Submitted 17 January, 2020; v1 submitted 29 January, 2018; originally announced January 2018.

Journal ref: SysML 2018

arXiv:1706.08597 [pdf]

Democratizing Design for Future Computing Platforms

Authors: Luis Ceze, Mark D. Hill, Karthikeyan Sankaralingam, Thomas F. Wenisch

Abstract: Information and communications technology can continue to change our world. These advances will partially depend upon designs that synergistically combine software with specialized hardware. Today open-source software incubates rapid software-only innovation. The government can unleash software-hardware innovation with programs to develop open hardware components, tools, and design flows that simp… ▽ More Information and communications technology can continue to change our world. These advances will partially depend upon designs that synergistically combine software with specialized hardware. Today open-source software incubates rapid software-only innovation. The government can unleash software-hardware innovation with programs to develop open hardware components, tools, and design flows that simplify and reduce the cost of hardware design. Such programs will speed development for startup companies, established industry leaders, education, scientific research, and for government intelligence and defense platforms. △ Less

Submitted 26 June, 2017; originally announced June 2017.

Comments: A Computing Community Consortium (CCC) white paper, 4 pages

arXiv:1706.04332 [pdf, other]

MATIC: Learning Around Errors for Efficient Low-Voltage Neural Network Accelerators

Authors: Sung Kim, Patrick Howe, Thierry Moreau, Armin Alaghi, Luis Ceze, Visvesh Sathe

Abstract: As a result of the increasing demand for deep neural network (DNN)-based services, efforts to develop dedicated hardware accelerators for DNNs are growing rapidly. However,while accelerators with high performance and efficiency on convolutional deep neural networks (Conv-DNNs) have been developed, less progress has been made with regards to fully-connected DNNs (FC-DNNs). In this paper, we propose… ▽ More As a result of the increasing demand for deep neural network (DNN)-based services, efforts to develop dedicated hardware accelerators for DNNs are growing rapidly. However,while accelerators with high performance and efficiency on convolutional deep neural networks (Conv-DNNs) have been developed, less progress has been made with regards to fully-connected DNNs (FC-DNNs). In this paper, we propose MATIC (Memory Adaptive Training with In-situ Canaries), a methodology that enables aggressive voltage scaling of accelerator weight memories to improve the energy-efficiency of DNN accelerators. To enable accurate operation with voltage overscaling, MATIC combines the characteristics of destructive SRAM reads with the error resilience of neural networks in a memory-adaptive training process. Furthermore, PVT-related voltage margins are eliminated using bit-cells from synaptic weights as in-situ canaries to track runtime environmental variation. Demonstrated on a low-power DNN accelerator that we fabricate in 65 nm CMOS, MATIC enables up to 60-80 mV of voltage overscaling (3.3x total energy reduction versus the nominal voltage), or 18.6x application error reduction. △ Less

Submitted 23 March, 2018; v1 submitted 14 June, 2017; originally announced June 2017.

Comments: 6 pages, 12 figures, 3 tables. Published at Design, Automation and Test in Europe Conference and Exhibition (DATE) 2018

arXiv:1706.03864 [pdf, other]

Exploring Computation-Communication Tradeoffs in Camera Systems

Authors: Amrita Mazumdar, Thierry Moreau, Sung Kim, Meghan Cowan, Armin Alaghi, Luis Ceze, Mark Oskin, Visvesh Sathe

Abstract: Cameras are the defacto sensor. The growing demand for real-time and low-power computer vision, coupled with trends towards high-efficiency heterogeneous systems, has given rise to a wide range of image processing acceleration techniques at the camera node and in the cloud. In this paper, we characterize two novel camera systems that use acceleration techniques to push the extremes of energy and p… ▽ More Cameras are the defacto sensor. The growing demand for real-time and low-power computer vision, coupled with trends towards high-efficiency heterogeneous systems, has given rise to a wide range of image processing acceleration techniques at the camera node and in the cloud. In this paper, we characterize two novel camera systems that use acceleration techniques to push the extremes of energy and performance scaling, and explore the computation-communication tradeoffs in their design. The first case study targets a camera system designed to detect and authenticate individual faces, running solely on energy harvested from RFID readers. We design a multi-accelerator SoC design operating in the sub-mW range, and evaluate it with real-world workloads to show performance and energy efficiency improvements over a general purpose microprocessor. The second camera system supports a 16-camera rig processing over 32 Gb/s of data to produce real-time 3D-360 degree virtual reality video. We design a multi-FPGA processing pipeline that outperforms CPU and GPU configurations by up to 10x in computation time, producing panoramic stereo video directly from the camera rig at 30 frames per second. We find that an early data reduction step, either before complex processing or offloading, is the most critical optimization for in-camera systems. △ Less

Submitted 16 October, 2017; v1 submitted 12 June, 2017; originally announced June 2017.

Journal ref: 2017 IEEE International Symposium on Workload Characterization (IISWC)

arXiv:1706.02344 [pdf]

Energy-Efficient Hybrid Stochastic-Binary Neural Networks for Near-Sensor Computing

Authors: Vincent T. Lee, Armin Alaghi, John P. Hayes, Visvesh Sathe, Luis Ceze

Abstract: Recent advances in neural networks (NNs) exhibit unprecedented success at transforming large, unstructured data streams into compact higher-level semantic information for tasks such as handwriting recognition, image classification, and speech recognition. Ideally, systems would employ near-sensor computation to execute these tasks at sensor endpoints to maximize data reduction and minimize data mo… ▽ More Recent advances in neural networks (NNs) exhibit unprecedented success at transforming large, unstructured data streams into compact higher-level semantic information for tasks such as handwriting recognition, image classification, and speech recognition. Ideally, systems would employ near-sensor computation to execute these tasks at sensor endpoints to maximize data reduction and minimize data movement. However, near- sensor computing presents its own set of challenges such as operating power constraints, energy budgets, and communication bandwidth capacities. In this paper, we propose a stochastic- binary hybrid design which splits the computation between the stochastic and binary domains for near-sensor NN applications. In addition, our design uses a new stochastic adder and multiplier that are significantly more accurate than existing adders and multipliers. We also show that retraining the binary portion of the NN computation can compensate for precision losses introduced by shorter stochastic bit-streams, allowing faster run times at minimal accuracy losses. Our evaluation shows that our hybrid stochastic-binary design can achieve 9.8x energy efficiency savings, and application-level accuracies within 0.05% compared to conventional all-binary designs. △ Less

Submitted 7 June, 2017; originally announced June 2017.

Comments: 6 pages, 3 figures, Design, Automata and Test in Europe (DATE) 2017

arXiv:1704.05112 [pdf, ps, other]

Making data center computations fast, but not so furious

Authors: Daniel Porto, João Loff, Rui Duarte, Luis Ceze, Rodrigo Rodrigues

Abstract: We propose an aggressive computational sprinting variant for data center environments. While most of previous work on computational sprinting focuses on maximizing the sprinting process while ensuring non-faulty conditions, we take advantage of the existing replication in data centers to push the system beyond its safety limits. In this paper we outline this vision, we survey existing techniques f… ▽ More We propose an aggressive computational sprinting variant for data center environments. While most of previous work on computational sprinting focuses on maximizing the sprinting process while ensuring non-faulty conditions, we take advantage of the existing replication in data centers to push the system beyond its safety limits. In this paper we outline this vision, we survey existing techniques for achieving it, and we present some design ideas for future work in this area. △ Less

Submitted 17 April, 2017; originally announced April 2017.

Comments: The 7th Workshop on Multi-core and Rack Scale Systems - MARS'17

arXiv:1612.03182 [pdf]

Arch2030: A Vision of Computer Architecture Research over the Next 15 Years

Authors: Luis Ceze, Mark D. Hill, Thomas F. Wenisch

Abstract: Application trends, device technologies and the architecture of systems drive progress in information technologies. However, the former engines of such progress - Moore's Law and Dennard Scaling - are rapidly reaching the point of diminishing returns. The time has come for the computing community to boldly confront a new challenge: how to secure a foundational future for information technology's c… ▽ More Application trends, device technologies and the architecture of systems drive progress in information technologies. However, the former engines of such progress - Moore's Law and Dennard Scaling - are rapidly reaching the point of diminishing returns. The time has come for the computing community to boldly confront a new challenge: how to secure a foundational future for information technology's continued progress. The computer architecture community engaged in several visioning exercises over the years. Five years ago, we released a white paper, 21st Century Computer Architecture, which influenced funding programs in both academia and industry. More recently, the IEEE Rebooting Computing Initiative explored the future of computing systems in the architecture, device, and circuit domains. This report stems from an effort to continue this dialogue, reach out to the applications and devices/circuits communities, and understand their trends and vision. We aim to identify opportunities where architecture research can bridge the gap between the application and device domains. △ Less

Submitted 9 December, 2016; originally announced December 2016.

Comments: A Computing Community Consortium (CCC) white paper, 7 pages

arXiv:1609.06756 [pdf]

21st Century Computer Architecture

Authors: Mark D. Hill, Sarita Adve, Luis Ceze, Mary Jane Irwin, David Kaeli, Margaret Martonosi, Josep Torrellas, Thomas F. Wenisch, David Wood, Katherine Yelick

Abstract: Because most technology and computer architecture innovations were (intentionally) invisible to higher layers, application and other software developers could reap the benefits of this progress without engaging in it. Higher performance has both made more computationally demanding applications feasible (e.g., virtual assistants, computer vision) and made less demanding applications easier to devel… ▽ More Because most technology and computer architecture innovations were (intentionally) invisible to higher layers, application and other software developers could reap the benefits of this progress without engaging in it. Higher performance has both made more computationally demanding applications feasible (e.g., virtual assistants, computer vision) and made less demanding applications easier to develop by enabling higher-level programming abstractions (e.g., scripting languages and reusable components). Improvements in computer system cost-effectiveness enabled value creation that could never have been imagined by the field's founders (e.g., distributed web search sufficiently inexpensive so as to be covered by advertising links). The wide benefits of computer performance growth are clear. Recently, Danowitz et al. apportioned computer performance growth roughly equally between technology and architecture, with architecture credited with ~80x improvement since 1985. As semiconductor technology approaches its "end-of-the-road" (see below), computer architecture will need to play an increasing role in enabling future ICT innovation. But instead of asking, "How can I make my chip run faster?," architects must now ask, "How can I enable the 21st century infrastructure, from sensors to clouds, adding value from performance to privacy, but without the benefit of near-perfect technology scaling?". The challenges are many, but with appropriate investment, opportunities abound. Underlying these opportunities is a common theme that future architecture innovations will require the engagement of and investments from innovators in other ICT layers. △ Less

Submitted 21 September, 2016; originally announced September 2016.

Comments: A Computing Community Consortium (CCC) white paper, 16 pages

arXiv:1608.03175 [pdf, other]

Similarity Search on Automata Processors

Authors: Vincent T. Lee, Justin Kotalik, Carlo C. Del Mundo, Armin Alaghi, Luis Ceze, Mark Oskin

Abstract: Similarity search is a critical primitive for a wide variety of applications including natural language processing, content-based search, machine learning, computer vision, databases, robotics, and recommendation systems. At its core, similarity search is implemented using the k-nearest neighbors (kNN) algorithm, where computation consists of highly parallel distance calculations and a global top-… ▽ More Similarity search is a critical primitive for a wide variety of applications including natural language processing, content-based search, machine learning, computer vision, databases, robotics, and recommendation systems. At its core, similarity search is implemented using the k-nearest neighbors (kNN) algorithm, where computation consists of highly parallel distance calculations and a global top-k sort. In contemporary von-Neumann architectures, kNN is bottlenecked by data movement which limits throughput and latency. In this paper, we present and evaluate a novel automata-based algorithm for kNN on the Micron Automata Processor (AP), which is a non-von Neumann near-data processing architecture. By employing near-data processing, the AP minimizes the data movement bottleneck and is able to achieve better performance. Unlike prior work in the automata processing space, our work combines temporal encodings with automata design to augment the space of applications for the AP. We evaluate our design's performance on the AP and compare to state-of-the-art CPU, GPU, and FPGA implementations; we show that the current generation of AP hardware can achieve over 50x speedup over CPUs while maintaining competitive energy efficiency gains. We also propose several automata optimization techniques and simple architectural extensions that highlight the potential of the AP hardware. △ Less

Submitted 7 June, 2017; v1 submitted 9 August, 2016; originally announced August 2016.

Comments: 12 pages, 11 figures, accepted to International Parallel and Distribution Processing Symposium (IPDPS) 2017

arXiv:1606.03742 [pdf, other]

Application-Driven Near-Data Processing for Similarity Search

Authors: Vincent T. Lee, Amrita Mazumdar, Carlo C. del Mundo, Armin Alaghi, Luis Ceze, Mark Oskin

Abstract: Similarity search is a key to a variety of applications including content-based search for images and video, recommendation systems, data deduplication, natural language processing, computer vision, databases, computational biology, and computer graphics. At its core, similarity search manifests as k-nearest neighbors (kNN), a computationally simple primitive consisting of highly parallel distance… ▽ More Similarity search is a key to a variety of applications including content-based search for images and video, recommendation systems, data deduplication, natural language processing, computer vision, databases, computational biology, and computer graphics. At its core, similarity search manifests as k-nearest neighbors (kNN), a computationally simple primitive consisting of highly parallel distance calculations and a global top-k sort. However, kNN is poorly supported by today's architectures because of its high memory bandwidth requirements. This paper proposes an application-driven near-data processing accelerator for similarity search: the Similarity Search Associative Memory (SSAM). By instantiating compute units close to memory, SSAM benefits from the higher memory bandwidth and density exposed by emerging memory technologies. We evaluate the SSAM design down to layout on top of the Micron hybrid memory cube (HMC), and show that SSAM can achieve up to two orders of magnitude area-normalized throughput and energy efficiency improvement over multicore CPUs; we also show SSAM is faster and more energy efficient than competing GPUs and FPGAs. Finally, we show that SSAM is also useful for other data intensive tasks like kNN index construction, and can be generalized to semantically function as a high capacity content addressable memory. △ Less

Submitted 10 July, 2017; v1 submitted 12 June, 2016; originally announced June 2016.

Comments: 15 pages, 8 figures, 7 tables

arXiv:1510.03955 [pdf, other]

SAP: an Architecture for Selectively Approximate Wireless Communication

Authors: Benjamin Ransford, Luis Ceze

Abstract: Integrity checking is ubiquitous in data networks, but not all network traffic needs integrity protection. Many applications can tolerate slightly damaged data while still working acceptably, trading accuracy versus efficiency to save time and energy. Such applications should be able to receive damaged data if they so desire. In today's network stacks, lower-layer integrity checks discard damaged… ▽ More Integrity checking is ubiquitous in data networks, but not all network traffic needs integrity protection. Many applications can tolerate slightly damaged data while still working acceptably, trading accuracy versus efficiency to save time and energy. Such applications should be able to receive damaged data if they so desire. In today's network stacks, lower-layer integrity checks discard damaged data regardless of the application's wishes, violating the End-to-End Principle. This paper argues for optional integrity checking and gently redesigns a commodity network architecture to support integrity-unprotected data. Our scheme, called Selective Approximate Protocol (SAP), allows applications to coordinate multiple network layers to accept potentially damaged data. Unlike previous schemes that targeted video or media streaming, SAP is generic. SAP's improved throughput and decreased retransmission rate is a good match for applications in the domain of approximate computing. Implemented atop WiFi as a case study, SAP works with existing physical layers and requires no hardware changes. SAP's benefits increase as channel conditions degrade. In tests of an error-tolerant file-transfer application over WiFi, SAP sped up transmission by about 30% on average. △ Less

Submitted 13 October, 2015; originally announced October 2015.

arXiv:1103.6114 [pdf, ps, other]

The Impact of Memory Models on Software Reliability in Multiprocessors

Authors: Alexander Jaffe, Thomas Moscibroda, Laura Effinger-Dean, Luis Ceze, Karin Strauss

Abstract: The memory consistency model is a fundamental system property characterizing a multiprocessor. The relative merits of strict versus relaxed memory models have been widely debated in terms of their impact on performance, hardware complexity and programmability. This paper adds a new dimension to this discussion: the impact of memory models on software reliability. By allowing some instructions to r… ▽ More The memory consistency model is a fundamental system property characterizing a multiprocessor. The relative merits of strict versus relaxed memory models have been widely debated in terms of their impact on performance, hardware complexity and programmability. This paper adds a new dimension to this discussion: the impact of memory models on software reliability. By allowing some instructions to reorder, weak memory models may expand the window between critical memory operations. This can increase the chance of an undesirable thread-interleaving, thus allowing an otherwise-unlikely concurrency bug to manifest. To explore this phenomenon, we define and study a probabilistic model of shared-memory parallel programs that takes into account such reordering. We use this model to formally derive bounds on the \emph{vulnerability} to concurrency bugs of different memory models. Our results show that for 2 (or a small constant number of) concurrent threads, weaker memory models do indeed have a higher likelihood of allowing bugs. On the other hand, we show that as the number of parallel threads increases, the gap between the different memory models becomes proportionally insignificant. This suggests the counter-intuitive rule that \emph{as the number of parallel threads in the system increases, the importance of using a strict memory model diminishes}; which potentially has major implications on the choice of memory consistency models in future multi-core systems. △ Less

Submitted 6 April, 2011; v1 submitted 31 March, 2011; originally announced March 2011.

Comments: 15 pages, 2 figures, conference

Showing 1–35 of 35 results for author: Ceze, L