Search | arXiv e-print repository

DeepMachining: Online Prediction of Machining Errors of Lathe Machines

Authors: Xiang-Li Lu, Hwai-Jung Hsu, Che-Wei Chou, H. T. Kung, Chen-Hsin Lee, Sheng-Mao Cheng

Abstract: We describe DeepMachining, a deep learning-based AI system for online prediction of machining errors of lathe machine operations. We have built and evaluated DeepMachining based on manufacturing data from factories. Specifically, we first pretrain a deep learning model for a given lathe machine's operations to learn the salient features of machining states. Then, we fine-tune the pretrained model… ▽ More We describe DeepMachining, a deep learning-based AI system for online prediction of machining errors of lathe machine operations. We have built and evaluated DeepMachining based on manufacturing data from factories. Specifically, we first pretrain a deep learning model for a given lathe machine's operations to learn the salient features of machining states. Then, we fine-tune the pretrained model to adapt to specific machining tasks. We demonstrate that DeepMachining achieves high prediction accuracy for multiple tasks that involve different workpieces and cutting tools. To the best of our knowledge, this work is one of the first factory experiments using pre-trained deep-learning models to predict machining errors of lathe machines. △ Less

Submitted 28 March, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

arXiv:2402.15504 [pdf, other]

Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition

Authors: Chun-Hsiao Yeh, Ta-Ying Cheng, He-Yen Hsieh, Chuan-En Lin, Yi Ma, Andrew Markham, Niki Trigoni, H. T. Kung, Yubei Chen

Abstract: Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts (e.g., their own pets or specific items) with just a few examples for training. This paper tackles two interconnected issues within this realm of personalizing text-to-image diffusion models. First, current personalization techniques fail to reliably extend to multiple concepts --… ▽ More Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts (e.g., their own pets or specific items) with just a few examples for training. This paper tackles two interconnected issues within this realm of personalizing text-to-image diffusion models. First, current personalization techniques fail to reliably extend to multiple concepts -- we hypothesize this to be due to the mismatch between complex scenes and simple text descriptions in the pre-training dataset (e.g., LAION). Second, given an image containing multiple personalized concepts, there lacks a holistic metric that evaluates performance on not just the degree of resemblance of personalized concepts, but also whether all concepts are present in the image and whether the image accurately reflects the overall text description. To address these issues, we introduce Gen4Gen, a semi-automated dataset creation pipeline utilizing generative models to combine personalized concepts into complex compositions along with text-descriptions. Using this, we create a dataset called MyCanvas, that can be used to benchmark the task of multi-concept personalization. In addition, we design a comprehensive metric comprising two scores (CP-CLIP and TI-CLIP) for better quantifying the performance of multi-concept, personalized text-to-image diffusion methods. We provide a simple baseline built on top of Custom Diffusion with empirical prompting strategies for future researchers to evaluate on MyCanvas. We show that by improving data quality and prompting strategies, we can significantly increase multi-concept personalized image generation quality, without requiring any modifications to model architecture or training algorithms. △ Less

Submitted 23 February, 2024; originally announced February 2024.

Comments: Preprint; Project Page: https://danielchyeh.github.io/Gen4Gen/

arXiv:2307.03930 [pdf, other]

Rosko: Row Skipping Outer Products for Sparse Matrix Multiplication Kernels

Authors: Vikas Natesh, Andrew Sabot, H. T. Kung, Mark Ting

Abstract: We propose Rosko -- row skipping outer products -- for deriving sparse matrix multiplication (SpMM) kernels in reducing computation and memory access requirements of deep neural networks (DNNs). Rosko allows skipping of entire row computations during program execution with low sparsity-management overheads. We analytically derive sparse CPU kernels that adapt to given hardware characteristics to e… ▽ More We propose Rosko -- row skipping outer products -- for deriving sparse matrix multiplication (SpMM) kernels in reducing computation and memory access requirements of deep neural networks (DNNs). Rosko allows skipping of entire row computations during program execution with low sparsity-management overheads. We analytically derive sparse CPU kernels that adapt to given hardware characteristics to effectively utilize processor cores and minimize data movement without the need for auto-tuning or search space exploration. Rosko can be integrated with other outer product scheduling methods, allowing them to leverage row skipping by using Rosko's packing format to skip unnecessary computation. Rosko kernels outperform existing auto-tuning and search-based solutions as well as state-of-the-art vendor-optimized libraries on real hardware across a variety of neural network workloads. For matrices with sparsities ranging from 65% to 99.8% typically found in machine learning, Rosko kernels achieve up to a 6.5x runtime reduction on Intel and ARM CPUs. △ Less

Submitted 8 July, 2023; originally announced July 2023.

Comments: Rosko's CPU implementation can be found at https://github.com/vnatesh/Rosko

arXiv:2304.05544 [pdf, other]

MEMA Runtime Framework: Minimizing External Memory Accesses for TinyML on Microcontrollers

Authors: Andrew Sabot, Vikas Natesh, H. T. Kung, Wei-Te Ting

Abstract: We present the MEMA framework for the easy and quick derivation of efficient inference runtimes that minimize external memory accesses for matrix multiplication on TinyML systems. The framework accounts for hardware resource constraints and problem sizes in analytically determining optimized schedules and kernels that minimize memory accesses. MEMA provides a solution to a well-known problem in th… ▽ More We present the MEMA framework for the easy and quick derivation of efficient inference runtimes that minimize external memory accesses for matrix multiplication on TinyML systems. The framework accounts for hardware resource constraints and problem sizes in analytically determining optimized schedules and kernels that minimize memory accesses. MEMA provides a solution to a well-known problem in the current practice, that is, optimal schedules tend to be found only through a time consuming and heuristic search of a large scheduling space. We compare the performance of runtimes derived from MEMA to existing state-of-the-art libraries on ARM-based TinyML systems. For example, for neural network benchmarks on the ARM Cortex-M4, we achieve up to a 1.8x speedup and 44% energy reduction over CMSIS-NN. △ Less

Submitted 11 April, 2023; originally announced April 2023.

Comments: Accepted as a full paper by the TinyML Research Symposium 2023

arXiv:2301.01947 [pdf, ps, other]

StitchNet: Composing Neural Networks from Pre-Trained Fragments

Authors: Surat Teerapittayanon, Marcus Comiter, Brad McDanel, H. T. Kung

Abstract: We propose StitchNet, a novel neural network creation paradigm that stitches together fragments (one or more consecutive network layers) from multiple pre-trained neural networks. StitchNet allows the creation of high-performing neural networks without the large compute and data requirements needed under traditional model creation processes via backpropagation training. We leverage Centered Kernel… ▽ More We propose StitchNet, a novel neural network creation paradigm that stitches together fragments (one or more consecutive network layers) from multiple pre-trained neural networks. StitchNet allows the creation of high-performing neural networks without the large compute and data requirements needed under traditional model creation processes via backpropagation training. We leverage Centered Kernel Alignment (CKA) as a compatibility measure to efficiently guide the selection of these fragments in composing a network for a given task tailored to specific accuracy needs and computing resource constraints. We then show that these fragments can be stitched together to create neural networks with accuracy comparable to that of traditionally trained networks at a fraction of computing resource and data requirements. Finally, we explore a novel on-the-fly personalized model creation and inference application enabled by this new paradigm. The code is available at https://github.com/steerapi/stitchnet. △ Less

Submitted 23 September, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

arXiv:2209.12127 [pdf, other]

SpeedLimit: Neural Architecture Search for Quantized Transformer Models

Authors: Yuji Chai, Luke Bailey, Yunho Jin, Matthew Karle, Glenn G. Ko, David Brooks, Gu-Yeon Wei, H. T. Kung

Abstract: While research in the field of transformer models has primarily focused on enhancing performance metrics such as accuracy and perplexity, practical applications in industry often necessitate a rigorous consideration of inference latency constraints. Addressing this challenge, we introduce SpeedLimit, a novel Neural Architecture Search (NAS) technique that optimizes accuracy whilst adhering to an u… ▽ More While research in the field of transformer models has primarily focused on enhancing performance metrics such as accuracy and perplexity, practical applications in industry often necessitate a rigorous consideration of inference latency constraints. Addressing this challenge, we introduce SpeedLimit, a novel Neural Architecture Search (NAS) technique that optimizes accuracy whilst adhering to an upper-bound latency constraint. Our method incorporates 8-bit integer quantization in the search process to outperform the current state-of-the-art technique. Our results underline the feasibility and efficacy of seeking an optimal balance between performance and latency, providing new avenues for deploying state-of-the-art transformer models in latency-sensitive environments. △ Less

Submitted 13 October, 2023; v1 submitted 24 September, 2022; originally announced September 2022.

arXiv:2207.09413 [pdf, other]

SphereFed: Hyperspherical Federated Learning

Authors: Xin Dong, Sai Qian Zhang, Ang Li, H. T. Kung

Abstract: Federated Learning aims at training a global model from multiple decentralized devices (i.e. clients) without exchanging their private local data. A key challenge is the handling of non-i.i.d. (independent identically distributed) data across multiple clients that may induce disparities of their local features. We introduce the Hyperspherical Federated Learning (SphereFed) framework to address the… ▽ More Federated Learning aims at training a global model from multiple decentralized devices (i.e. clients) without exchanging their private local data. A key challenge is the handling of non-i.i.d. (independent identically distributed) data across multiple clients that may induce disparities of their local features. We introduce the Hyperspherical Federated Learning (SphereFed) framework to address the non-i.i.d. issue by constraining learned representations of data points to be on a unit hypersphere shared by clients. Specifically, all clients learn their local representations by minimizing the loss with respect to a fixed classifier whose weights span the unit hypersphere. After federated training in improving the global model, this classifier is further calibrated with a closed-form solution by minimizing a mean squared loss. We show that the calibration solution can be computed efficiently and distributedly without direct access of local data. Extensive experiments indicate that our SphereFed approach is able to improve the accuracy of multiple existing federated learning algorithms by a considerable margin (up to 6% on challenging datasets) with enhanced computation and communication efficiency across datasets and model architectures. △ Less

Submitted 19 July, 2022; originally announced July 2022.

Comments: European Conference on Computer Vision 2022

arXiv:2204.04705 [pdf, other]

SplitNets: Designing Neural Architectures for Efficient Distributed Computing on Head-Mounted Systems

Authors: Xin Dong, Barbara De Salvo, Meng Li, Chiao Liu, Zhongnan Qu, H. T. Kung, Ziyun Li

Abstract: We design deep neural networks (DNNs) and corresponding networks' splittings to distribute DNNs' workload to camera sensors and a centralized aggregator on head mounted devices to meet system performance targets in inference accuracy and latency under the given hardware resource constraints. To achieve an optimal balance among computation, communication, and performance, a split-aware neural archi… ▽ More We design deep neural networks (DNNs) and corresponding networks' splittings to distribute DNNs' workload to camera sensors and a centralized aggregator on head mounted devices to meet system performance targets in inference accuracy and latency under the given hardware resource constraints. To achieve an optimal balance among computation, communication, and performance, a split-aware neural architecture search framework, SplitNets, is introduced to conduct model designing, splitting, and communication reduction simultaneously. We further extend the framework to multi-view systems for learning to fuse inputs from multiple camera sensors with optimal performance and systemic efficiency. We validate SplitNets for single-view system on ImageNet as well as multi-view system on 3D classification, and show that the SplitNets framework achieves state-of-the-art (SOTA) performance and system latency compared with existing approaches. △ Less

Submitted 10 April, 2022; originally announced April 2022.

Comments: IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022

arXiv:2110.15456 [pdf, other]

FAST: DNN Training Under Variable Precision Block Floating Point with Stochastic Rounding

Authors: Sai Qian Zhang, Bradley McDanel, H. T. Kung

Abstract: Block Floating Point (BFP) can efficiently support quantization for Deep Neural Network (DNN) training by providing a wide dynamic range via a shared exponent across a group of values. In this paper, we propose a Fast First, Accurate Second Training (FAST) system for DNNs, where the weights, activations, and gradients are represented in BFP. FAST supports matrix multiplication with variable precis… ▽ More Block Floating Point (BFP) can efficiently support quantization for Deep Neural Network (DNN) training by providing a wide dynamic range via a shared exponent across a group of values. In this paper, we propose a Fast First, Accurate Second Training (FAST) system for DNNs, where the weights, activations, and gradients are represented in BFP. FAST supports matrix multiplication with variable precision BFP input operands, enabling incremental increases in DNN precision throughout training. By increasing the BFP precision across both training iterations and DNN layers, FAST can greatly shorten the training time while reducing overall hardware resource usage. Our FAST Multipler-Accumulator (fMAC) supports dot product computations under multiple BFP precisions. We validate our FAST system on multiple DNNs with different datasets, demonstrating a 2-6$\times$ speedup in training on a single-chip platform over prior work based on \textbf{mixed-precision or block} floating point number systems while achieving similar performance in validation accuracy. △ Less

Submitted 28 October, 2021; originally announced October 2021.

arXiv:2107.06304 [pdf, other]

Privacy Vulnerability of Split Computing to Data-Free Model Inversion Attacks

Authors: Xin Dong, Hongxu Yin, Jose M. Alvarez, Jan Kautz, Pavlo Molchanov, H. T. Kung

Abstract: Mobile edge devices see increased demands in deep neural networks (DNNs) inference while suffering from stringent constraints in computing resources. Split computing (SC) emerges as a popular approach to the issue by executing only initial layers on devices and offloading the remaining to the cloud. Prior works usually assume that SC offers privacy benefits as only intermediate features, instead o… ▽ More Mobile edge devices see increased demands in deep neural networks (DNNs) inference while suffering from stringent constraints in computing resources. Split computing (SC) emerges as a popular approach to the issue by executing only initial layers on devices and offloading the remaining to the cloud. Prior works usually assume that SC offers privacy benefits as only intermediate features, instead of private data, are shared from devices to the cloud. In this work, we debunk this SC-induced privacy protection by (i) presenting a novel data-free model inversion method and (ii) demonstrating sample inversion where private data from devices can still be leaked with high fidelity from the shared feature even after tens of neural network layers. We propose Divide-and-Conquer Inversion (DCI) which partitions the given deep network into multiple shallow blocks and inverts each block with an inversion method. Additionally, cycle-consistency technique is introduced by re-directing the inverted results back to the model under attack in order to better supervise the training of the inversion modules. In contrast to prior art based on generative priors and computation-intensive optimization in deriving inverted samples, DCI removes the need for real device data and generative priors, and completes inversion with a single quick forward pass over inversion modules. For the first time, we scale data-free and sample-specific inversion to deep architectures and large datasets for both discriminative and generative networks. We perform model inversion attack to ResNet and RepVGG models on ImageNet and SNGAN on CelebA and recover the original input from intermediate features more than 40 layers deep into the network. △ Less

Submitted 24 October, 2022; v1 submitted 13 July, 2021; originally announced July 2021.

Comments: A new data-free inversion method to reverse neural networks and get input from intermediate feature maps. BMVC'22

arXiv:2107.05455 [pdf, ps, other]

A Local Diagnosis Algorithm for Hypercube-like Networks under the BGM Diagnosis Model

Authors: Cheng-Kuan Lin, Tzu-Liang Kung, Chun-Nan Hung, Yuan-Hsiang Teng

Abstract: System diagnosis is process of identifying faulty nodes in a system. An efficient diagnosis is crucial for a multiprocessor system. The BGM diagnosis model is a modification of the PMC diagnosis model, which is a test-based diagnosis. In this paper, we present a specific structure and propose an algorithm for diagnosing a node in a system under the BGM model. We also give a polynomial-time algorit… ▽ More System diagnosis is process of identifying faulty nodes in a system. An efficient diagnosis is crucial for a multiprocessor system. The BGM diagnosis model is a modification of the PMC diagnosis model, which is a test-based diagnosis. In this paper, we present a specific structure and propose an algorithm for diagnosing a node in a system under the BGM model. We also give a polynomial-time algorithm that a node in a hypercube-like network can be diagnosed correctly in three test rounds under the BGM diagnosis model. △ Less

Submitted 8 June, 2022; v1 submitted 30 June, 2021; originally announced July 2021.

Journal ref: Fundamenta Informaticae, Volume 185, Issue 4 (July 7, 2022) fi:7674

arXiv:2104.11408 [pdf, other]

Neural Mean Discrepancy for Efficient Out-of-Distribution Detection

Authors: Xin Dong, Junfeng Guo, Ang Li, Wei-Te Ting, Cong Liu, H. T. Kung

Abstract: Various approaches have been proposed for out-of-distribution (OOD) detection by augmenting models, input examples, training sets, and optimization objectives. Deviating from existing work, we have a simple hypothesis that standard off-the-shelf models may already contain sufficient information about the training set distribution which can be leveraged for reliable OOD detection. Our empirical stu… ▽ More Various approaches have been proposed for out-of-distribution (OOD) detection by augmenting models, input examples, training sets, and optimization objectives. Deviating from existing work, we have a simple hypothesis that standard off-the-shelf models may already contain sufficient information about the training set distribution which can be leveraged for reliable OOD detection. Our empirical study on validating this hypothesis, which measures the model activation's mean for OOD and in-distribution (ID) mini-batches, surprisingly finds that activation means of OOD mini-batches consistently deviate more from those of the training data. In addition, training data's activation means can be computed offline efficiently or retrieved from batch normalization layers as a 'free lunch'. Based upon this observation, we propose a novel metric called Neural Mean Discrepancy (NMD), which compares neural means of the input examples and training data. Leveraging the simplicity of NMD, we propose an efficient OOD detector that computes neural means by a standard forward pass followed by a lightweight classifier. Extensive experiments show that NMD outperforms state-of-the-art OOD approaches across multiple datasets and model architectures in terms of both detection accuracy and computational cost. △ Less

Submitted 26 March, 2022; v1 submitted 23 April, 2021; originally announced April 2021.

Comments: IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022

arXiv:2007.06389 [pdf, other]

Term Revealing: Furthering Quantization at Run Time on Quantized DNNs

Authors: H. T. Kung, Bradley McDanel, Sai Qian Zhang

Abstract: We present a novel technique, called Term Revealing (TR), for furthering quantization at run time for improved performance of Deep Neural Networks (DNNs) already quantized with conventional quantization methods. TR operates on power-of-two terms in binary expressions of values. In computing a dot-product computation, TR dynamically selects a fixed number of largest terms to use from the values of… ▽ More We present a novel technique, called Term Revealing (TR), for furthering quantization at run time for improved performance of Deep Neural Networks (DNNs) already quantized with conventional quantization methods. TR operates on power-of-two terms in binary expressions of values. In computing a dot-product computation, TR dynamically selects a fixed number of largest terms to use from the values of the two vectors in the dot product. By exploiting normal-like weight and data distributions typically present in DNNs, TR has a minimal impact on DNN model performance (i.e., accuracy or perplexity). We use TR to facilitate tightly synchronized processor arrays, such as systolic arrays, for efficient parallel processing. We show an FPGA implementation that can use a small number of control bits to switch between conventional quantization and TR-enabled quantization with a negligible delay. To enhance TR efficiency further, we use a signed digit representation (SDR), as opposed to classic binary encoding with only nonnegative power-of-two terms. To perform conversion from binary to SDR, we develop an efficient encoding method called HESE (Hybrid Encoding for Signed Expressions) that can be performed in one pass looking at only two bits at a time. We evaluate TR with HESE encoded values on an MLP for MNIST, multiple CNNs for ImageNet, and an LSTM for Wikitext-2, and show significant reductions in inference computations (between 3-10x) compared to conventional quantization for the same level of model performance. △ Less

Submitted 26 July, 2020; v1 submitted 13 July, 2020; originally announced July 2020.

Comments: 13 pages, 19 figures, 4 tables, To appear in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2020 Update: Revised writing/figures and added more references for Section IV Update: Revised Section IV writing/figures and added additional references on signed digit representations

arXiv:1907.08377 [pdf, other]

DaiMoN: A Decentralized Artificial Intelligence Model Network

Authors: Surat Teerapittayanon, H. T. Kung

Abstract: We introduce DaiMoN, a decentralized artificial intelligence model network, which incentivizes peer collaboration in improving the accuracy of machine learning models for a given classification problem. It is an autonomous network where peers may submit models with improved accuracy and other peers may verify the accuracy improvement. The system maintains an append-only decentralized ledger to kee… ▽ More We introduce DaiMoN, a decentralized artificial intelligence model network, which incentivizes peer collaboration in improving the accuracy of machine learning models for a given classification problem. It is an autonomous network where peers may submit models with improved accuracy and other peers may verify the accuracy improvement. The system maintains an append-only decentralized ledger to keep the log of critical information, including who has trained the model and improved its accuracy, when it has been improved, by how much it has improved, and where to find the newly updated model. DaiMoN rewards these contributing peers with cryptographic tokens. A main feature of DaiMoN is that it allows peers to verify the accuracy improvement of submitted models without knowing the test labels. This is an essential component in order to mitigate intentional model overfitting by model-improving peers. To enable this model accuracy evaluation with hidden test labels, DaiMoN uses a novel learnable Distance Embedding for Labels (DEL) function proposed in this paper. Specific to each test dataset, DEL scrambles the test label vector by embedding it in a low-dimension space while approximately preserving the distance between the dataset's test label vector and a label vector inferred by the classifier. It therefore allows proof-of-improvement (PoI) by peers without providing them access to true test labels. We provide analysis and empirical evidence that under DEL, peers can accurately assess model accuracy. We also argue that it is hard to invert the embedding function and thus, DEL is resilient against attacks aiming to recover test labels in order to cheat. Our prototype implementation of DaiMoN is available at https://github.com/steerapi/daimon. △ Less

Submitted 19 July, 2019; originally announced July 2019.

Comments: 2019 IEEE International Conference on Blockchain

arXiv:1906.07148 [pdf, other]

CheckNet: Secure Inference on Untrusted Devices

Authors: Marcus Comiter, Surat Teerapittayanon, H. T. Kung

Abstract: We introduce CheckNet, a method for secure inference with deep neural networks on untrusted devices. CheckNet is like a checksum for neural network inference: it verifies the integrity of the inference computation performed by untrusted devices to 1) ensure the inference has actually been performed, and 2) ensure the inference has not been manipulated by an attacker. CheckNet is completely transpa… ▽ More We introduce CheckNet, a method for secure inference with deep neural networks on untrusted devices. CheckNet is like a checksum for neural network inference: it verifies the integrity of the inference computation performed by untrusted devices to 1) ensure the inference has actually been performed, and 2) ensure the inference has not been manipulated by an attacker. CheckNet is completely transparent to the third party running the computation, applicable to all types of neural networks, does not require specialized hardware, adds little overhead, and has negligible impact on model performance. CheckNet can be configured to provide different levels of security depending on application needs and compute/communication budgets. We present both empirical and theoretical validation of CheckNet on multiple popular deep neural network models, showing excellent attack detection (0.88-0.99 AUC) and attack success bounds. △ Less

Submitted 17 June, 2019; originally announced June 2019.

arXiv:1905.00462 [pdf, other]

Full-stack Optimization for Accelerating CNNs with FPGA Validation

Authors: Bradley McDanel, Sai Qian Zhang, H. T. Kung, Xin Dong

Abstract: We present a full-stack optimization framework for accelerating inference of CNNs (Convolutional Neural Networks) and validate the approach with field-programmable gate arrays (FPGA) implementations. By jointly optimizing CNN models, computing architectures, and hardware implementations, our full-stack approach achieves unprecedented performance in the trade-off space characterized by inference la… ▽ More We present a full-stack optimization framework for accelerating inference of CNNs (Convolutional Neural Networks) and validate the approach with field-programmable gate arrays (FPGA) implementations. By jointly optimizing CNN models, computing architectures, and hardware implementations, our full-stack approach achieves unprecedented performance in the trade-off space characterized by inference latency, energy efficiency, hardware utilization and inference accuracy. As a validation vehicle, we have implemented a 170MHz FPGA inference chip achieving 2.28ms latency for the ImageNet benchmark. The achieved latency is among the lowest reported in the literature while achieving comparable accuracy. However, our chip shines in that it has 9x higher energy efficiency compared to other implementations achieving comparable latency. A highlight of our full-stack approach which attributes to the achieved high energy efficiency is an efficient Selector-Accumulator (SAC) architecture for implementing the multiplier-accumulator (MAC) operation present in any digital CNN hardware. For instance, compared to a FPGA implementation for a traditional 8-bit MAC, SAC substantially reduces required hardware resources (4.85x fewer Look-up Tables) and power consumption (2.48x). △ Less

Submitted 1 May, 2019; originally announced May 2019.

arXiv:1812.05083 [pdf, other]

Adversarial Learning of Semantic Relevance in Text to Image Synthesis

Authors: Miriam Cha, Youngjune L. Gwon, H. T. Kung

Abstract: We describe a new approach that improves the training of generative adversarial nets (GANs) for synthesizing diverse images from a text input. Our approach is based on the conditional version of GANs and expands on previous work leveraging an auxiliary task in the discriminator. Our generated images are not limited to certain classes and do not suffer from mode collapse while semantically matching… ▽ More We describe a new approach that improves the training of generative adversarial nets (GANs) for synthesizing diverse images from a text input. Our approach is based on the conditional version of GANs and expands on previous work leveraging an auxiliary task in the discriminator. Our generated images are not limited to certain classes and do not suffer from mode collapse while semantically matching the text input. A key to our training methods is how to form positive and negative training examples with respect to the class label of a given image. Instead of selecting random training examples, we perform negative sampling based on the semantic distance from a positive example in the class. We evaluate our approach using the Oxford-102 flower dataset, adopting the inception score and multi-scale structural similarity index (MS-SSIM) metrics to assess discriminability and diversity of the generated images. The empirical results indicate greater diversity in the generated images, especially when we gradually select more negative training examples closer to a positive example in the semantic space. △ Less

Submitted 5 February, 2019; v1 submitted 12 December, 2018; originally announced December 2018.

arXiv:1811.04770 [pdf, other]

Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization

Authors: H. T. Kung, Bradley McDanel, Sai Qian Zhang

Abstract: This paper describes a novel approach of packing sparse convolutional neural networks for their efficient systolic array implementations. By combining subsets of columns in the original filter matrix associated with a convolutional layer, we increase the utilization efficiency of the systolic array substantially (e.g., ~4x) due to the increased density of nonzeros in the resulting packed filter ma… ▽ More This paper describes a novel approach of packing sparse convolutional neural networks for their efficient systolic array implementations. By combining subsets of columns in the original filter matrix associated with a convolutional layer, we increase the utilization efficiency of the systolic array substantially (e.g., ~4x) due to the increased density of nonzeros in the resulting packed filter matrix. In combining columns, for each row, all filter weights but one with the largest magnitude are pruned. We retrain the remaining weights to preserve high accuracy. We demonstrate that in mitigating data privacy concerns the retraining can be accomplished with only fractions of the original dataset (e.g., 10\% for CIFAR-10). We study the effectiveness of this joint optimization for both high utilization and classification accuracy with ASIC and FPGA designs based on efficient bit-serial implementations of multiplier-accumulators. We present analysis and empirical evidence on the superior performance of our column combining approach against prior arts under metrics such as energy efficiency (3x) and inference latency (12x). △ Less

Submitted 7 November, 2018; originally announced November 2018.

Comments: To appear in ASPLOS 2019

arXiv:1802.03373 [pdf, other]

InferBeam: A Fast Beam Alignment Protocol for Millimeter-wave Networking

Authors: Sai Qian Zhang, H. T. Kung, Youngjune Gwon

Abstract: We introduce fast millimeter-wave base station (BS) and its antenna sector selection for user equipment based on its location. Using a conditional random field inference model with specially designed parameters, which are robust to change of environment, InferBeam allows the use of measurement samples on best beam selection at a small number of locations to infer the rest dynamically. Compared to… ▽ More We introduce fast millimeter-wave base station (BS) and its antenna sector selection for user equipment based on its location. Using a conditional random field inference model with specially designed parameters, which are robust to change of environment, InferBeam allows the use of measurement samples on best beam selection at a small number of locations to infer the rest dynamically. Compared to beam-sweeping based approaches in the literature, InferBeam can drastically reduce the setup cost for beam alignment for a new environment, and also the latency in acquiring a new beam under intermittent blockage. We have evaluated InferBeam using a discrete event simulation. Our results indicate that the system can make best beam selection for 98% of locations in test environments comprising smallsized apartment or office spaces, while sampling fewer than 1% of locations. InferBeam is a complete protocol for best beam inference that can be integrated into millimeter-wave standards for accelerating the much-needed fast and economic beam alignment capability. △ Less

Submitted 5 March, 2018; v1 submitted 9 February, 2018; originally announced February 2018.

arXiv:1710.07830 [pdf, other]

Incomplete Dot Products for Dynamic Computation Scaling in Neural Network Inference

Authors: Bradley McDanel, Surat Teerapittayanon, H. T. Kung

Abstract: We propose the use of incomplete dot products (IDP) to dynamically adjust the number of input channels used in each layer of a convolutional neural network during feedforward inference. IDP adds monotonically non-increasing coefficients, referred to as a "profile", to the channels during training. The profile orders the contribution of each channel in non-increasing order. At inference time, the n… ▽ More We propose the use of incomplete dot products (IDP) to dynamically adjust the number of input channels used in each layer of a convolutional neural network during feedforward inference. IDP adds monotonically non-increasing coefficients, referred to as a "profile", to the channels during training. The profile orders the contribution of each channel in non-increasing order. At inference time, the number of channels used can be dynamically adjusted to trade off accuracy for lowered power consumption and reduced latency by selecting only a beginning subset of channels. This approach allows for a single network to dynamically scale over a computation range, as opposed to training and deploying multiple networks to support different levels of computation scaling. Additionally, we extend the notion to multiple profiles, each optimized for some specific range of computation scaling. We present experiments on the computation and accuracy trade-offs of IDP for popular image classification models and datasets. We demonstrate that, for MNIST and CIFAR-10, IDP reduces computation significantly, e.g., by 75%, without significantly compromising accuracy. We argue that IDP provides a convenient and effective means for devices to lower computation costs dynamically to reflect the current computation budget of the system. For example, VGG-16 with 50% IDP (using only the first 50% of channels) achieves 70% in accuracy on the CIFAR-10 dataset compared to the standard network which achieves only 35% accuracy when using the reduced channel set. △ Less

Submitted 21 October, 2017; originally announced October 2017.

arXiv:1709.02260 [pdf, other]

Embedded Binarized Neural Networks

Authors: Bradley McDanel, Surat Teerapittayanon, H. T. Kung

Abstract: We study embedded Binarized Neural Networks (eBNNs) with the aim of allowing current binarized neural networks (BNNs) in the literature to perform feedforward inference efficiently on small embedded devices. We focus on minimizing the required memory footprint, given that these devices often have memory as small as tens of kilobytes (KB). Beyond minimizing the memory required to store weights, as… ▽ More We study embedded Binarized Neural Networks (eBNNs) with the aim of allowing current binarized neural networks (BNNs) in the literature to perform feedforward inference efficiently on small embedded devices. We focus on minimizing the required memory footprint, given that these devices often have memory as small as tens of kilobytes (KB). Beyond minimizing the memory required to store weights, as in a BNN, we show that it is essential to minimize the memory used for temporaries which hold intermediate results between layers in feedforward inference. To accomplish this, eBNN reorders the computation of inference while preserving the original BNN structure, and uses just a single floating-point temporary for the entire neural network. All intermediate results from a layer are stored as binary values, as opposed to floating-points used in current BNN implementations, leading to a 32x reduction in required temporary space. We provide empirical evidence that our proposed eBNN approach allows efficient inference (10s of ms) on devices with severely limited memory (10s of KB). For example, eBNN achieves 95\% accuracy on the MNIST dataset running on an Intel Curie with only 15 KB of usable memory with an inference runtime of under 50 ms per sample. To ease the development of applications in embedded contexts, we make our source code available that allows users to train and discover eBNN models for a learning task at hand, which fit within the memory constraint of the target device. △ Less

Submitted 6 September, 2017; originally announced September 2017.

arXiv:1709.01921 [pdf, other]

Distributed Deep Neural Networks over the Cloud, the Edge and End Devices

Authors: Surat Teerapittayanon, Bradley McDanel, H. T. Kung

Abstract: We propose distributed deep neural networks (DDNNs) over distributed computing hierarchies, consisting of the cloud, the edge (fog) and end devices. While being able to accommodate inference of a deep neural network (DNN) in the cloud, a DDNN also allows fast and localized inference using shallow portions of the neural network at the edge and end devices. When supported by a scalable distributed c… ▽ More We propose distributed deep neural networks (DDNNs) over distributed computing hierarchies, consisting of the cloud, the edge (fog) and end devices. While being able to accommodate inference of a deep neural network (DNN) in the cloud, a DDNN also allows fast and localized inference using shallow portions of the neural network at the edge and end devices. When supported by a scalable distributed computing hierarchy, a DDNN can scale up in neural network size and scale out in geographical span. Due to its distributed nature, DDNNs enhance sensor fusion, system fault tolerance and data privacy for DNN applications. In implementing a DDNN, we map sections of a DNN onto a distributed computing hierarchy. By jointly training these sections, we minimize communication and resource usage for devices and maximize usefulness of extracted features which are utilized in the cloud. The resulting system has built-in support for automatic sensor fusion and fault tolerance. As a proof of concept, we show a DDNN can exploit geographical diversity of sensors to improve object recognition accuracy and reduce communication cost. In our experiment, compared with the traditional method of offloading raw sensor data to be processed in the cloud, DDNN locally processes most sensor data on end devices while achieving high accuracy and is able to reduce the communication cost by a factor of over 20x. △ Less

Submitted 6 September, 2017; originally announced September 2017.

arXiv:1709.01888 [pdf, other]

Language Modeling by Clustering with Word Embeddings for Text Readability Assessment

Authors: Miriam Cha, Youngjune Gwon, H. T. Kung

Abstract: We present a clustering-based language model using word embeddings for text readability prediction. Presumably, an Euclidean semantic space hypothesis holds true for word embeddings whose training is done by observing word co-occurrences. We argue that clustering with word embeddings in the metric space should yield feature representations in a higher semantic space appropriate for text regression… ▽ More We present a clustering-based language model using word embeddings for text readability prediction. Presumably, an Euclidean semantic space hypothesis holds true for word embeddings whose training is done by observing word co-occurrences. We argue that clustering with word embeddings in the metric space should yield feature representations in a higher semantic space appropriate for text regression. Also, by representing features in terms of histograms, our approach can naturally address documents of varying lengths. An empirical evaluation using the Common Core Standards corpus reveals that the features formed on our clustering-based language model significantly improve the previously known results for the same corpus in readability prediction. We also evaluate the task of sentence matching based on semantic relatedness using the Wiki-SimpleWiki corpus and find that our features lead to superior matching performance. △ Less

Submitted 4 September, 2017; originally announced September 2017.

arXiv:1709.01686 [pdf, other]

BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks

Authors: Surat Teerapittayanon, Bradley McDanel, H. T. Kung

Abstract: Deep neural networks are state of the art methods for many learning tasks due to their ability to extract increasingly better features at each network layer. However, the improved performance of additional layers in a deep network comes at the cost of added latency and energy usage in feedforward inference. As networks continue to get deeper and larger, these costs become more prohibitive for real… ▽ More Deep neural networks are state of the art methods for many learning tasks due to their ability to extract increasingly better features at each network layer. However, the improved performance of additional layers in a deep network comes at the cost of added latency and energy usage in feedforward inference. As networks continue to get deeper and larger, these costs become more prohibitive for real-time and energy-sensitive applications. To address this issue, we present BranchyNet, a novel deep network architecture that is augmented with additional side branch classifiers. The architecture allows prediction results for a large portion of test samples to exit the network early via these branches when samples can already be inferred with high confidence. BranchyNet exploits the observation that features learned at an early layer of a network may often be sufficient for the classification of many data points. For more difficult samples, which are expected less frequently, BranchyNet will use further or all network layers to provide the best likelihood of correct prediction. We study the BranchyNet architecture using several well-known networks (LeNet, AlexNet, ResNet) and datasets (MNIST, CIFAR10) and show that it can both improve accuracy and significantly reduce the inference time of the network. △ Less

Submitted 6 September, 2017; originally announced September 2017.

arXiv:1708.09321 [pdf, other]

Adversarial nets with perceptual losses for text-to-image synthesis

Authors: Miriam Cha, Youngjune Gwon, H. T. Kung

Abstract: Recent approaches in generative adversarial networks (GANs) can automatically synthesize realistic images from descriptive text. Despite the overall fair quality, the generated images often expose visible flaws that lack structural definition for an object of interest. In this paper, we aim to extend state of the art for GAN-based text-to-image synthesis by improving perceptual quality of generate… ▽ More Recent approaches in generative adversarial networks (GANs) can automatically synthesize realistic images from descriptive text. Despite the overall fair quality, the generated images often expose visible flaws that lack structural definition for an object of interest. In this paper, we aim to extend state of the art for GAN-based text-to-image synthesis by improving perceptual quality of generated images. Differentiated from previous work, our synthetic image generator optimizes on perceptual loss functions that measure pixel, feature activation, and texture differences against a natural image. We present visually more compelling synthetic images of birds and flowers generated from text descriptions in comparison to some of the most prominent existing work. △ Less

Submitted 30 August, 2017; originally announced August 2017.

arXiv:1605.05212 [pdf, other]

Multimodal Sparse Coding for Event Detection

Authors: Youngjune Gwon, William Campbell, Kevin Brady, Douglas Sturim, Miriam Cha, H. T. Kung

Abstract: Unsupervised feature learning methods have proven effective for classification tasks based on a single modality. We present multimodal sparse coding for learning feature representations shared across multiple modalities. The shared representations are applied to multimedia event detection (MED) and evaluated in comparison to unimodal counterparts, as well as other feature learning methods such as… ▽ More Unsupervised feature learning methods have proven effective for classification tasks based on a single modality. We present multimodal sparse coding for learning feature representations shared across multiple modalities. The shared representations are applied to multimedia event detection (MED) and evaluated in comparison to unimodal counterparts, as well as other feature learning methods such as GMM supervectors and sparse RBM. We report the cross-validated classification accuracy and mean average precision of the MED system trained on features learned from our unimodal and multimodal settings for a subset of the TRECVID MED 2014 dataset. △ Less

Submitted 17 May, 2016; originally announced May 2016.

Comments: Multimodal Machine Learning Workshop at NIPS 2015

arXiv:1511.06238 [pdf, other]

Multimodal sparse representation learning and applications

Authors: Miriam Cha, Youngjune Gwon, H. T. Kung

Abstract: Unsupervised methods have proven effective for discriminative tasks in a single-modality scenario. In this paper, we present a multimodal framework for learning sparse representations that can capture semantic correlation between modalities. The framework can model relationships at a higher level by forcing the shared sparse representation. In particular, we propose the use of joint dictionary lea… ▽ More Unsupervised methods have proven effective for discriminative tasks in a single-modality scenario. In this paper, we present a multimodal framework for learning sparse representations that can capture semantic correlation between modalities. The framework can model relationships at a higher level by forcing the shared sparse representation. In particular, we propose the use of joint dictionary learning technique for sparse coding and formulate the joint representation for concision, cross-modal representations (in case of a missing modality), and union of the cross-modal representations. Given the accelerated growth of multimodal data posted on the Web such as YouTube, Wikipedia, and Twitter, learning good multimodal features is becoming increasingly important. We show that the shared representations enabled by our framework substantially improve the classification performance under both unimodal and multimodal settings. We further show how deep architectures built on the proposed framework are effective for the case of highly nonlinear correlations between modalities. The effectiveness of our approach is demonstrated experimentally in image denoising, multimedia event detection and retrieval on the TRECVID dataset (audio-video), category classification on the Wikipedia dataset (image-text), and sentiment classification on PhotoTweet (image-text). △ Less

Submitted 2 March, 2016; v1 submitted 19 November, 2015; originally announced November 2015.

arXiv:1212.2894 [pdf, other]

Reducing Reconciliation Communication Cost with Compressed Sensing

Authors: H. T. Kung, Chia-Mu Yu

Abstract: We consider a reconciliation problem, where two hosts wish to synchronize their respective sets. Efficient solutions for minimizing the communication cost between the two hosts have been previously proposed in the literature. However, they rely on prior knowledge about the size of the set differences between the two sets to be reconciled. In this paper, we propose a method which can achieve compar… ▽ More We consider a reconciliation problem, where two hosts wish to synchronize their respective sets. Efficient solutions for minimizing the communication cost between the two hosts have been previously proposed in the literature. However, they rely on prior knowledge about the size of the set differences between the two sets to be reconciled. In this paper, we propose a method which can achieve comparable efficiency without assuming this prior knowledge. Our method uses compressive sensing techniques which can leverage the expected sparsity in set differences. We study the performance of the method via theoretical analysis and numerical simulations. △ Less

Submitted 4 December, 2012; originally announced December 2012.

Comments: 4 pages, 2 figures

arXiv:1004.3716 [pdf, ps, other]

Some linear-time algorithms for systolic arrays

Authors: Richard P. Brent, Franklin T. Luk, H. T. Kung

Abstract: We survey some results on linear-time algorithms for systolic arrays. In particular, we show how the greatest common divisor (GCD) of two polynomials of degree n over a finite field can be computed in time O(n) on a linear systolic array of O(n) cells; similarly for the GCD of two n-bit binary numbers. We show how n * n Toeplitz systems of linear equations can be solved in time O(n) on a linear ar… ▽ More We survey some results on linear-time algorithms for systolic arrays. In particular, we show how the greatest common divisor (GCD) of two polynomials of degree n over a finite field can be computed in time O(n) on a linear systolic array of O(n) cells; similarly for the GCD of two n-bit binary numbers. We show how n * n Toeplitz systems of linear equations can be solved in time O(n) on a linear array of O(n) cells, each of which has constant memory size (independent of n). Finally, we outline how a two-dimensional square array of O(n)* O(n) cells can be used to solve (to working accuracy) the eigenvalue problem for a symmetric real n* n matrix in time O(nS(n)). Here S(n) is a slowly growing function of n; for practical purposes S(n) can be regarded as a constant. In addition to their theoretical interest, these results have potential applications in the areas of error-correcting codes, symbolic and algebraic computations, signal processing and image processing. △ Less

Submitted 21 April, 2010; originally announced April 2010.

Comments: Corrected version of an old (1983) paper. 23 pages. For further details, see http://wwwmaths.anu.edu.au/~brent/pub/pub079.html

Report number: Report TR-CS-82-15, DCS, Australian National University, December 1982 MSC Class: 65Y05 (Primary) 37B15; 68Q10; 68Q80 (Secondary) ACM Class: G.1.3; B.6.1; C.1.3

Journal ref: Information Processing 83 (edited by R.E.A. Mason), North-Holland, Amsterdam, 1983, 865-876

arXiv:cs/9811028 [pdf, ps]

TCP Trunking

Authors: H. T. Kung, S. Y. Wang

Abstract: A TCP trunk is an IP tunnel under TCP control, capable of carrying packets from any number of user flows. By exploiting properties of TCP, a TCP trunk provides elastic and reliable transmission over a network, and automatically shares the network fairly with other competing trunks. Moreover, by aggregating user flows into a single trunk flow, TCP trunking can significantly reduce the number of f… ▽ More A TCP trunk is an IP tunnel under TCP control, capable of carrying packets from any number of user flows. By exploiting properties of TCP, a TCP trunk provides elastic and reliable transmission over a network, and automatically shares the network fairly with other competing trunks. Moreover, by aggregating user flows into a single trunk flow, TCP trunking can significantly reduce the number of flows that the network needs to manage, thereby allowing use of simplified management to achieve improved perfor mance. For example, when dealing with only a small number of TCP trunk flows, a router with a simple FIFO buffer can experience low packet loss rates. A TCP trunk is a "soft" circuit in the sense that it requires no flow states to be maintained inside the network. Setting up a TCP trunk involves only configuring the two end nodes. This is in contrast with traditional methods of configuring circuits via signaling of network nodes. A simple packet-dropping mechanism based on packet accounting at the transmitter of a TCP trunk assures that, when the trunk reduces its bandwidth in response to network congestion, user TCP flows carried by the trunk will reduce their bandwidths by the same proportion. Simu lation results have demonstrated that TCP trunks can provide improved network performance to users, while achieving high network utilization. △ Less

Submitted 20 November, 1998; originally announced November 1998.

Comments: postscript file

ACM Class: C.2.1

Showing 1–30 of 30 results for author: Kung, T