Search | arXiv e-print repository

DεpS: Delayed ε-Shrinking for Faster Once-For-All Training

Authors: Aditya Annavajjala, Alind Khare, Animesh Agrawal, Igor Fedorov, Hugo Latapie, Myungjin Lee, Alexey Tumanov

Abstract: CNNs are increasingly deployed across different hardware, dynamic environments, and low-power embedded devices. This has led to the design and training of CNN architectures with the goal of maximizing accuracy subject to such variable deployment constraints. As the number of deployment scenarios grows, there is a need to find scalable solutions to design and train specialized CNNs. Once-for-all tr… ▽ More CNNs are increasingly deployed across different hardware, dynamic environments, and low-power embedded devices. This has led to the design and training of CNN architectures with the goal of maximizing accuracy subject to such variable deployment constraints. As the number of deployment scenarios grows, there is a need to find scalable solutions to design and train specialized CNNs. Once-for-all training has emerged as a scalable approach that jointly co-trains many models (subnets) at once with a constant training cost and finds specialized CNNs later. The scalability is achieved by training the full model and simultaneously reducing it to smaller subnets that share model weights (weight-shared shrinking). However, existing once-for-all training approaches incur huge training costs reaching 1200 GPU hours. We argue this is because they either start the process of shrinking the full model too early or too late. Hence, we propose Delayed $ε$-Shrinking (D$ε$pS) that starts the process of shrinking the full model when it is partially trained (~50%) which leads to training cost improvement and better in-place knowledge distillation to smaller models. The proposed approach also consists of novel heuristics that dynamically adjust subnet learning rates incrementally (E), leading to improved weight-shared knowledge distillation from larger to smaller subnets as well. As a result, DEpS outperforms state-of-the-art once-for-all training techniques across different datasets including CIFAR10/100, ImageNet-100, and ImageNet-1k on accuracy and cost. It achieves 1.83% higher ImageNet-1k top1 accuracy or the same accuracy with 1.3x reduction in FLOPs and 2.5x drop in training cost (GPU*hrs) △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: Accepted to the 18th European Conference on Computer Vision (ECCV 2024)

arXiv:2405.16406 [pdf, other]

SpinQuant: LLM quantization with learned rotations

Authors: Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, Tijmen Blankevoort

Abstract: Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when outliers are present. Recent findings suggest that rotating activation or weight matrices helps remove outliers and benefits quantization. In this work, we identify a… ▽ More Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when outliers are present. Recent findings suggest that rotating activation or weight matrices helps remove outliers and benefits quantization. In this work, we identify a collection of applicable rotation parameterizations that lead to identical outputs in full-precision Transformer architectures, and find that some random rotations lead to much better quantization than others, with an up to 13 points difference in downstream zero-shot reasoning performance. As a result, we propose SpinQuant that optimizes (or learns) the rotation matrices with Cayley optimization on a small validation set. With 4-bit quantization of weight, activation, and KV-cache, SpinQuant narrows the accuracy gap on zero-shot reasoning tasks with full precision to merely 2.9 points on the LLaMA-2 7B model, surpassing LLM-QAT by 19.1 points and SmoothQuant by 25.0 points. SpinQuant also outperforms concurrent work QuaRot, which applies random rotations to remove outliers. In particular, for LLaMA-2 7B/LLaMA-3 8B models that are hard to quantize, SpinQuant reduces the gap to full precision by 30.2%/34.1% relative to QuaRot. △ Less

Submitted 28 May, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

arXiv:2402.14905 [pdf, other]

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

Authors: Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra

Abstract: This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our in… ▽ More This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight-sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases. △ Less

Submitted 26 June, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

Comments: ICML 2024. Code is available at https://github.com/facebookresearch/MobileLLM

arXiv:2311.13169 [pdf, other]

SiGeo: Sub-One-Shot NAS via Information Theory and Geometry of Loss Landscape

Authors: Hua Zheng, Kuang-Hung Liu, Igor Fedorov, Xin Zhang, Wen-Yen Chen, Wei Wen

Abstract: Neural Architecture Search (NAS) has become a widely used tool for automating neural network design. While one-shot NAS methods have successfully reduced computational requirements, they often require extensive training. On the other hand, zero-shot NAS utilizes training-free proxies to evaluate a candidate architecture's test performance but has two limitations: (1) inability to use the informati… ▽ More Neural Architecture Search (NAS) has become a widely used tool for automating neural network design. While one-shot NAS methods have successfully reduced computational requirements, they often require extensive training. On the other hand, zero-shot NAS utilizes training-free proxies to evaluate a candidate architecture's test performance but has two limitations: (1) inability to use the information gained as a network improves with training and (2) unreliable performance, particularly in complex domains like RecSys, due to the multi-modal data inputs and complex architecture configurations. To synthesize the benefits of both methods, we introduce a "sub-one-shot" paradigm that serves as a bridge between zero-shot and one-shot NAS. In sub-one-shot NAS, the supernet is trained using only a small subset of the training data, a phase we refer to as "warm-up." Within this framework, we present SiGeo, a proxy founded on a novel theoretical framework that connects the supernet warm-up with the efficacy of the proxy. Extensive experiments have shown that SiGeo, with the benefit of warm-up, consistently outperforms state-of-the-art NAS proxies on various established NAS benchmarks. When a supernet is warmed up, it can achieve comparable performance to weight-sharing one-shot NAS methods, but with a significant reduction ($\sim 60$\%) in computational costs. △ Less

Submitted 22 November, 2023; originally announced November 2023.

Comments: 24 pages, 7 figures

arXiv:2311.08430 [pdf, other]

Rankitect: Ranking Architecture Search Battling World-class Engineers at Meta Scale

Authors: Wei Wen, Kuang-Hung Liu, Igor Fedorov, Xin Zhang, Hang Yin, Weiwei Chu, Kaveh Hassani, Mengying Sun, Jiang Liu, Xu Wang, Lin Jiang, Yuxin Chen, Buyun Zhang, Xi Liu, Dehua Cheng, Zhengxing Chen, Guang Zhao, Fangqiu Han, Jiyan Yang, Yuchen Hao, Liang Xiong, Wen-Yen Chen

Abstract: Neural Architecture Search (NAS) has demonstrated its efficacy in computer vision and potential for ranking systems. However, prior work focused on academic problems, which are evaluated at small scale under well-controlled fixed baselines. In industry system, such as ranking system in Meta, it is unclear whether NAS algorithms from the literature can outperform production baselines because of: (1… ▽ More Neural Architecture Search (NAS) has demonstrated its efficacy in computer vision and potential for ranking systems. However, prior work focused on academic problems, which are evaluated at small scale under well-controlled fixed baselines. In industry system, such as ranking system in Meta, it is unclear whether NAS algorithms from the literature can outperform production baselines because of: (1) scale - Meta ranking systems serve billions of users, (2) strong baselines - the baselines are production models optimized by hundreds to thousands of world-class engineers for years since the rise of deep learning, (3) dynamic baselines - engineers may have established new and stronger baselines during NAS search, and (4) efficiency - the search pipeline must yield results quickly in alignment with the productionization life cycle. In this paper, we present Rankitect, a NAS software framework for ranking systems at Meta. Rankitect seeks to build brand new architectures by composing low level building blocks from scratch. Rankitect implements and improves state-of-the-art (SOTA) NAS methods for comprehensive and fair comparison under the same search space, including sampling-based NAS, one-shot NAS, and Differentiable NAS (DNAS). We evaluate Rankitect by comparing to multiple production ranking models at Meta. We find that Rankitect can discover new models from scratch achieving competitive tradeoff between Normalized Entropy loss and FLOPs. When utilizing search space designed by engineers, Rankitect can generate better models than engineers, achieving positive offline evaluation and online A/B test at Meta scale. △ Less

Submitted 13 November, 2023; originally announced November 2023.

Comments: Wei Wen and Kuang-Hung Liu contribute equally

arXiv:2311.00231 [pdf, other]

DistDNAS: Search Efficient Feature Interactions within 2 Hours

Authors: Tunhou Zhang, Wei Wen, Igor Fedorov, Xi Liu, Buyun Zhang, Fangqiu Han, Wen-Yen Chen, Yiping Han, Feng Yan, Hai Li, Yiran Chen

Abstract: Search efficiency and serving efficiency are two major axes in building feature interactions and expediting the model development process in recommender systems. On large-scale benchmarks, searching for the optimal feature interaction design requires extensive cost due to the sequential workflow on the large volume of data. In addition, fusing interactions of various sources, orders, and mathemati… ▽ More Search efficiency and serving efficiency are two major axes in building feature interactions and expediting the model development process in recommender systems. On large-scale benchmarks, searching for the optimal feature interaction design requires extensive cost due to the sequential workflow on the large volume of data. In addition, fusing interactions of various sources, orders, and mathematical operations introduces potential conflicts and additional redundancy toward recommender models, leading to sub-optimal trade-offs in performance and serving cost. In this paper, we present DistDNAS as a neat solution to brew swift and efficient feature interaction design. DistDNAS proposes a supernet to incorporate interaction modules of varying orders and types as a search space. To optimize search efficiency, DistDNAS distributes the search and aggregates the choice of optimal interaction modules on varying data dates, achieving over 25x speed-up and reducing search cost from 2 days to 2 hours. To optimize serving efficiency, DistDNAS introduces a differentiable cost-aware loss to penalize the selection of redundant interaction modules, enhancing the efficiency of discovered feature interactions in serving. We extensively evaluate the best models crafted by DistDNAS on a 1TB Criteo Terabyte dataset. Experimental evaluations demonstrate 0.001 AUC improvement and 60% FLOPs saving over current state-of-the-art CTR models. △ Less

Submitted 31 October, 2023; originally announced November 2023.

arXiv:2301.10999 [pdf, other]

PerfSAGE: Generalized Inference Performance Predictor for Arbitrary Deep Learning Models on Edge Devices

Authors: Yuji Chai, Devashree Tripathy, Chuteng Zhou, Dibakar Gope, Igor Fedorov, Ramon Matas, David Brooks, Gu-Yeon Wei, Paul Whatmough

Abstract: The ability to accurately predict deep neural network (DNN) inference performance metrics, such as latency, power, and memory footprint, for an arbitrary DNN on a target hardware platform is essential to the design of DNN based models. This ability is critical for the (manual or automatic) design, optimization, and deployment of practical DNNs for a specific hardware deployment platform. Unfortuna… ▽ More The ability to accurately predict deep neural network (DNN) inference performance metrics, such as latency, power, and memory footprint, for an arbitrary DNN on a target hardware platform is essential to the design of DNN based models. This ability is critical for the (manual or automatic) design, optimization, and deployment of practical DNNs for a specific hardware deployment platform. Unfortunately, these metrics are slow to evaluate using simulators (where available) and typically require measurement on the target hardware. This work describes PerfSAGE, a novel graph neural network (GNN) that predicts inference latency, energy, and memory footprint on an arbitrary DNN TFlite graph (TFL, 2017). In contrast, previously published performance predictors can only predict latency and are restricted to pre-defined construction rules or search spaces. This paper also describes the EdgeDLPerf dataset of 134,912 DNNs randomly sampled from four task search spaces and annotated with inference performance metrics from three edge hardware platforms. Using this dataset, we train PerfSAGE and provide experimental results that demonstrate state-of-the-art prediction accuracy with a Mean Absolute Percentage Error of <5% across all targets and model search spaces. These results: (1) Outperform previous state-of-art GNN-based predictors (Dudziak et al., 2020), (2) Accurately predict performance on accelerators (a shortfall of non-GNN-based predictors (Zhang et al., 2021)), and (3) Demonstrate predictions on arbitrary input graphs without modifications to the feature extractor. △ Less

Submitted 26 January, 2023; originally announced January 2023.

arXiv:2208.08562 [pdf, other]

Restructurable Activation Networks

Authors: Kartikeya Bhardwaj, James Ward, Caleb Tung, Dibakar Gope, Lingchuan Meng, Igor Fedorov, Alex Chalfin, Paul Whatmough, Danny Loh

Abstract: Is it possible to restructure the non-linear activation functions in a deep network to create hardware-efficient models? To address this question, we propose a new paradigm called Restructurable Activation Networks (RANs) that manipulate the amount of non-linearity in models to improve their hardware-awareness and efficiency. First, we propose RAN-explicit (RAN-e) -- a new hardware-aware search sp… ▽ More Is it possible to restructure the non-linear activation functions in a deep network to create hardware-efficient models? To address this question, we propose a new paradigm called Restructurable Activation Networks (RANs) that manipulate the amount of non-linearity in models to improve their hardware-awareness and efficiency. First, we propose RAN-explicit (RAN-e) -- a new hardware-aware search space and a semi-automatic search algorithm -- to replace inefficient blocks with hardware-aware blocks. Next, we propose a training-free model scaling method called RAN-implicit (RAN-i) where we theoretically prove the link between network topology and its expressivity in terms of number of non-linear units. We demonstrate that our networks achieve state-of-the-art results on ImageNet at different scales and for several types of hardware. For example, compared to EfficientNet-Lite-B0, RAN-e achieves a similar accuracy while improving Frames-Per-Second (FPS) by 1.5x on Arm micro-NPUs. On the other hand, RAN-i demonstrates up to 2x reduction in #MACs over ConvNexts with a similar or better accuracy. We also show that RAN-i achieves nearly 40% higher FPS than ConvNext on Arm-based datacenter CPUs. Finally, RAN-i based object detection networks achieve a similar or higher mAP and up to 33% higher FPS on datacenter CPUs compared to ConvNext based models. The code to train and evaluate RANs and the pretrained networks are available at https://github.com/ARM-software/ML-restructurable-activation-networks. △ Less

Submitted 7 September, 2022; v1 submitted 17 August, 2022; originally announced August 2022.

Comments: This work was presented at an Arm AI virtual tech talk. Video is available at https://www.youtube.com/watch?v=EUqFNE28Kq4

arXiv:2202.13826 [pdf, ps, other]

doi 10.21437/Odyssey.2022-1

Magnitude-aware Probabilistic Speaker Embeddings

Authors: Nikita Kuzmin, Igor Fedorov, Alexey Sholokhov

Abstract: Recently, hyperspherical embeddings have established themselves as a dominant technique for face and voice recognition. Specifically, Euclidean space vector embeddings are learned to encode person-specific information in their direction while ignoring the magnitude. However, recent studies have shown that the magnitudes of the embeddings extracted by deep neural networks may indicate the quality o… ▽ More Recently, hyperspherical embeddings have established themselves as a dominant technique for face and voice recognition. Specifically, Euclidean space vector embeddings are learned to encode person-specific information in their direction while ignoring the magnitude. However, recent studies have shown that the magnitudes of the embeddings extracted by deep neural networks may indicate the quality of the corresponding inputs. This paper explores the properties of the magnitudes of the embeddings related to quality assessment and out-of-distribution detection. We propose a new probabilistic speaker embedding extractor using the information encoded in the embedding magnitude and leverage it in the speaker verification pipeline. We also propose several quality-aware diarization methods and incorporate the magnitudes in those. Our results indicate significant improvements over magnitude-agnostic baselines both in speaker verification and diarization tasks. △ Less

Submitted 23 October, 2022; v1 submitted 28 February, 2022; originally announced February 2022.

Comments: Accepted to Odyssey 2022: The Speaker and Language Recognition Workshop, camera-ready version

arXiv:2201.05842 [pdf, other]

UDC: Unified DNAS for Compressible TinyML Models

Authors: Igor Fedorov, Ramon Matas, Hokchhay Tann, Chuteng Zhou, Matthew Mattina, Paul Whatmough

Abstract: Deploying TinyML models on low-cost IoT hardware is very challenging, due to limited device memory capacity. Neural processing unit (NPU) hardware address the memory challenge by using model compression to exploit weight quantization and sparsity to fit more parameters in the same footprint. However, designing compressible neural networks (NNs) is challenging, as it expands the design space across… ▽ More Deploying TinyML models on low-cost IoT hardware is very challenging, due to limited device memory capacity. Neural processing unit (NPU) hardware address the memory challenge by using model compression to exploit weight quantization and sparsity to fit more parameters in the same footprint. However, designing compressible neural networks (NNs) is challenging, as it expands the design space across which we must make balanced trade-offs. This paper demonstrates Unified DNAS for Compressible (UDC) NNs, which explores a large search space to generate state-of-the-art compressible NNs for NPU. ImageNet results show UDC networks are up to $3.35\times$ smaller (iso-accuracy) or 6.25% more accurate (iso-model size) than previous work. △ Less

Submitted 5 January, 2023; v1 submitted 15 January, 2022; originally announced January 2022.

arXiv:2010.11267 [pdf, other]

MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers

Authors: Colby Banbury, Chuteng Zhou, Igor Fedorov, Ramon Matas Navarro, Urmish Thakker, Dibakar Gope, Vijay Janapa Reddi, Matthew Mattina, Paul N. Whatmough

Abstract: Executing machine learning workloads locally on resource constrained microcontrollers (MCUs) promises to drastically expand the application space of IoT. However, so-called TinyML presents severe technical challenges, as deep neural network inference demands a large compute and memory budget. To address this challenge, neural architecture search (NAS) promises to help design accurate ML models tha… ▽ More Executing machine learning workloads locally on resource constrained microcontrollers (MCUs) promises to drastically expand the application space of IoT. However, so-called TinyML presents severe technical challenges, as deep neural network inference demands a large compute and memory budget. To address this challenge, neural architecture search (NAS) promises to help design accurate ML models that meet the tight MCU memory, latency and energy constraints. A key component of NAS algorithms is their latency/energy model, i.e., the mapping from a given neural network architecture to its inference latency/energy on an MCU. In this paper, we observe an intriguing property of NAS search spaces for MCU model design: on average, model latency varies linearly with model operation (op) count under a uniform prior over models in the search space. Exploiting this insight, we employ differentiable NAS (DNAS) to search for models with low memory usage and low op count, where op count is treated as a viable proxy to latency. Experimental results validate our methodology, yielding our MicroNet models, which we deploy on MCUs using Tensorflow Lite Micro, a standard open-source NN inference runtime widely used in the TinyML community. MicroNets demonstrate state-of-the-art results for all three TinyMLperf industry-standard benchmark tasks: visual wake words, audio keyword spotting, and anomaly detection. Models and training scripts can be found at github.com/ARM-software/ML-zoo. △ Less

Submitted 12 April, 2021; v1 submitted 21 October, 2020; originally announced October 2020.

Comments: 10 pages, 8 figures, 3 tables

arXiv:2005.11394 [pdf, other]

MANGO: A Python Library for Parallel Hyperparameter Tuning

Authors: Sandeep Singh Sandha, Mohit Aggarwal, Igor Fedorov, Mani Srivastava

Abstract: Tuning hyperparameters for machine learning algorithms is a tedious task, one that is typically done manually. To enable automated hyperparameter tuning, recent works have started to use techniques based on Bayesian optimization. However, to practically enable automated tuning for large scale machine learning training pipelines, significant gaps remain in existing libraries, including lack of abst… ▽ More Tuning hyperparameters for machine learning algorithms is a tedious task, one that is typically done manually. To enable automated hyperparameter tuning, recent works have started to use techniques based on Bayesian optimization. However, to practically enable automated tuning for large scale machine learning training pipelines, significant gaps remain in existing libraries, including lack of abstractions, fault tolerance, and flexibility to support scheduling on any distributed computing framework. To address these challenges, we present Mango, a Python library for parallel hyperparameter tuning. Mango enables the use of any distributed scheduling framework, implements intelligent parallel search strategies, and provides rich abstractions for defining complex hyperparameter search spaces that are compatible with scikit-learn. Mango is comparable in performance to Hyperopt, another widely used library. Mango is available open-source and is currently used in production at Arm Research to provide state-of-art hyperparameter tuning capabilities. △ Less

Submitted 22 May, 2020; originally announced May 2020.

Comments: 5 pages, 3 figures, ICASSP Conference

arXiv:2005.11138 [pdf, other]

doi 10.21437/Interspeech.2020-1864

TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids

Authors: Igor Fedorov, Marko Stamenovic, Carl Jensen, Li-Chia Yang, Ari Mandell, Yiming Gan, Matthew Mattina, Paul N. Whatmough

Abstract: Modern speech enhancement algorithms achieve remarkable noise suppression by means of large recurrent neural networks (RNNs). However, large RNNs limit practical deployment in hearing aid hardware (HW) form-factors, which are battery powered and run on resource-constrained microcontroller units (MCUs) with limited memory capacity and compute capability. In this work, we use model compression techn… ▽ More Modern speech enhancement algorithms achieve remarkable noise suppression by means of large recurrent neural networks (RNNs). However, large RNNs limit practical deployment in hearing aid hardware (HW) form-factors, which are battery powered and run on resource-constrained microcontroller units (MCUs) with limited memory capacity and compute capability. In this work, we use model compression techniques to bridge this gap. We define the constraints imposed on the RNN by the HW and describe a method to satisfy them. Although model compression techniques are an active area of research, we are the first to demonstrate their efficacy for RNN speech enhancement, using pruning and integer quantization of weights/activations. We also demonstrate state update skipping, which reduces the computational load. Finally, we conduct a perceptual evaluation of the compressed models to verify audio quality on human raters. Results show a reduction in model size and operations of 11.9$\times$ and 2.9$\times$, respectively, over the baseline for compressed models, without a statistical difference in listening preference and only exhibiting a loss of 0.55dB SDR. Our model achieves a computational latency of 2.39ms, well within the 10ms target and 351$\times$ better than previous work. △ Less

Submitted 20 May, 2020; originally announced May 2020.

Comments: First four authors contributed equally. For audio samples, see https://github.com/BoseCorp/efficient-neural-speech-enhancement

arXiv:1910.02558 [pdf, other]

Pushing the limits of RNN Compression

Authors: Urmish Thakker, Igor Fedorov, Jesse Beu, Dibakar Gope, Chu Zhou, Ganesh Dasika, Matthew Mattina

Abstract: Recurrent Neural Networks (RNN) can be difficult to deploy on resource constrained devices due to their size. As a result, there is a need for compression techniques that can significantly compress RNNs without negatively impacting task accuracy. This paper introduces a method to compress RNNs for resource constrained environments using Kronecker product (KP). KPs can compress RNN layers by 16-38x… ▽ More Recurrent Neural Networks (RNN) can be difficult to deploy on resource constrained devices due to their size. As a result, there is a need for compression techniques that can significantly compress RNNs without negatively impacting task accuracy. This paper introduces a method to compress RNNs for resource constrained environments using Kronecker product (KP). KPs can compress RNN layers by 16-38x with minimal accuracy loss. We show that KP can beat the task accuracy achieved by other state-of-the-art compression techniques (pruning and low-rank matrix factorization) across 4 benchmarks spanning 3 different applications, while simultaneously improving inference run-time. △ Less

Submitted 9 October, 2019; v1 submitted 4 October, 2019; originally announced October 2019.

Comments: 6 pages. arXiv admin note: substantial text overlap with arXiv:1906.02876

Journal ref: 5th edition of Workshop on Energy Efficient Machine Learning and Cognitive Computing at NeurIPS 2019

arXiv:1906.02876 [pdf, other]

Compressing RNNs for IoT devices by 15-38x using Kronecker Products

Authors: Urmish Thakker, Jesse Beu, Dibakar Gope, Chu Zhou, Igor Fedorov, Ganesh Dasika, Matthew Mattina

Abstract: Recurrent Neural Networks (RNN) can be difficult to deploy on resource constrained devices due to their size.As a result, there is a need for compression techniques that can significantly compress RNNs without negatively impacting task accuracy. This paper introduces a method to compress RNNs for resource constrained environments using Kronecker product (KP). KPs can compress RNN layers by 15-38x… ▽ More Recurrent Neural Networks (RNN) can be difficult to deploy on resource constrained devices due to their size.As a result, there is a need for compression techniques that can significantly compress RNNs without negatively impacting task accuracy. This paper introduces a method to compress RNNs for resource constrained environments using Kronecker product (KP). KPs can compress RNN layers by 15-38x with minimal accuracy loss. By quantizing the resulting models to 8-bits, we further push the compression factor to 50x. We show that KP can beat the task accuracy achieved by other state-of-the-art compression techniques across 5 benchmarks spanning 3 different applications, while simultaneously improving inference run-time. We show that the KP compression mechanism does introduce an accuracy loss, which can be mitigated by a proposed hybrid KP (HKP) approach. Our HKP algorithm provides fine-grained control over the compression ratio, enabling us to regain accuracy lost during compression by adding a small number of model parameters. △ Less

Submitted 31 January, 2020; v1 submitted 6 June, 2019; originally announced June 2019.

arXiv:1905.12107 [pdf, ps, other]

SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers

Authors: Igor Fedorov, Ryan P. Adams, Matthew Mattina, Paul N. Whatmough

Abstract: The vast majority of processors in the world are actually microcontroller units (MCUs), which find widespread use performing simple control tasks in applications ranging from automobiles to medical devices and office equipment. The Internet of Things (IoT) promises to inject machine learning into many of these every-day objects via tiny, cheap MCUs. However, these resource-impoverished hardware pl… ▽ More The vast majority of processors in the world are actually microcontroller units (MCUs), which find widespread use performing simple control tasks in applications ranging from automobiles to medical devices and office equipment. The Internet of Things (IoT) promises to inject machine learning into many of these every-day objects via tiny, cheap MCUs. However, these resource-impoverished hardware platforms severely limit the complexity of machine learning models that can be deployed. For example, although convolutional neural networks (CNNs) achieve state-of-the-art results on many visual recognition tasks, CNN inference on MCUs is challenging due to severe finite memory limitations. To circumvent the memory challenge associated with CNNs, various alternatives have been proposed that do fit within the memory budget of an MCU, albeit at the cost of prediction accuracy. This paper challenges the idea that CNNs are not suitable for deployment on MCUs. We demonstrate that it is possible to automatically design CNNs which generalize well, while also being small enough to fit onto memory-limited MCUs. Our Sparse Architecture Search method combines neural architecture search with pruning in a single, unified approach, which learns superior models on four popular IoT datasets. The CNNs we find are more accurate and up to $4.35\times$ smaller than previous approaches, while meeting the strict MCU working memory constraint. △ Less

Submitted 28 May, 2019; originally announced May 2019.

arXiv:1804.03740 [pdf, other]

Multimodal Sparse Bayesian Dictionary Learning

Authors: Igor Fedorov, Bhaskar D. Rao

Abstract: This paper addresses the problem of learning dictionaries for multimodal datasets, i.e. datasets collected from multiple data sources. We present an algorithm called multimodal sparse Bayesian dictionary learning (MSBDL). MSBDL leverages information from all available data modalities through a joint sparsity constraint. The underlying framework offers a considerable amount of flexibility to practi… ▽ More This paper addresses the problem of learning dictionaries for multimodal datasets, i.e. datasets collected from multiple data sources. We present an algorithm called multimodal sparse Bayesian dictionary learning (MSBDL). MSBDL leverages information from all available data modalities through a joint sparsity constraint. The underlying framework offers a considerable amount of flexibility to practitioners and addresses many of the shortcomings of existing multimodal dictionary learning approaches. In particular, the procedure includes the automatic tuning of hyperparameters and is unique in that it allows the dictionaries for each data modality to have different cardinality, a significant feature in cases when the dimensionality of data differs across modalities. MSBDL is scalable and can be used in supervised learning settings. Theoretical results relating to the convergence of MSBDL are presented and the numerical results provide evidence of the superior performance of MSBDL on synthetic and real datasets compared to existing methods. △ Less

Submitted 28 May, 2019; v1 submitted 10 April, 2018; originally announced April 2018.

arXiv:1802.01616 [pdf, ps, other]

Re-Weighted Learning for Sparsifying Deep Neural Networks

Authors: Igor Fedorov, Bhaskar D. Rao

Abstract: This paper addresses the topic of sparsifying deep neural networks (DNN's). While DNN's are powerful models that achieve state-of-the-art performance on a large number of tasks, the large number of model parameters poses serious storage and computational challenges. To combat these difficulties, a growing line of work focuses on pruning network weights without sacrificing performance. We propose a… ▽ More This paper addresses the topic of sparsifying deep neural networks (DNN's). While DNN's are powerful models that achieve state-of-the-art performance on a large number of tasks, the large number of model parameters poses serious storage and computational challenges. To combat these difficulties, a growing line of work focuses on pruning network weights without sacrificing performance. We propose a general affine scaling transformation (AST) algorithm to sparsify DNN's. Our approach follows in the footsteps of popular sparse recovery techniques, which have yet to be explored in the context of DNN's. We describe a principled framework for transforming densely connected DNN's into sparsely connected ones without sacrificing network performance. Unlike existing methods, our approach is able to learn sparse connections at each layer simultaneously, and achieves comparable pruning results on the architecture tested. △ Less

Submitted 5 February, 2018; originally announced February 2018.

arXiv:1703.10645 [pdf, other]

Relevance Subject Machine: A Novel Person Re-identification Framework

Authors: Igor Fedorov, Ritwik Giri, Bhaskar D. Rao, Truong Q. Nguyen

Abstract: We propose a novel method called the Relevance Subject Machine (RSM) to solve the person re-identification (re-id) problem. RSM falls under the category of Bayesian sparse recovery algorithms and uses the sparse representation of the input video under a pre-defined dictionary to identify the subject in the video. Our approach focuses on the multi-shot re-id problem, which is the prevalent problem… ▽ More We propose a novel method called the Relevance Subject Machine (RSM) to solve the person re-identification (re-id) problem. RSM falls under the category of Bayesian sparse recovery algorithms and uses the sparse representation of the input video under a pre-defined dictionary to identify the subject in the video. Our approach focuses on the multi-shot re-id problem, which is the prevalent problem in many video analytics applications. RSM captures the essence of the multi-shot re-id problem by constraining the support of the sparse codes for each input video frame to be the same. Our proposed approach is also robust enough to deal with time varying outliers and occlusions by introducing a sparse, non-stationary noise term in the model error. We provide a novel Variational Bayesian based inference procedure along with an intuitive interpretation of the proposed update rules. We evaluate our approach over several commonly used re-id datasets and show superior performance over current state-of-the-art algorithms. Specifically, for ILIDS-VID, a recent large scale re-id dataset, RSM shows significant improvement over all published approaches, achieving an 11.5% (absolute) improvement in rank 1 accuracy over the closest competing algorithm considered. △ Less

Submitted 30 March, 2017; originally announced March 2017.

Comments: Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence

arXiv:1605.02057 [pdf, other]

Robust Bayesian Method for Simultaneous Block Sparse Signal Recovery with Applications to Face Recognition

Authors: Igor Fedorov, Ritwik Giri, Bhaskar D. Rao, Truong Q. Nguyen

Abstract: In this paper, we present a novel Bayesian approach to recover simultaneously block sparse signals in the presence of outliers. The key advantage of our proposed method is the ability to handle non-stationary outliers, i.e. outliers which have time varying support. We validate our approach with empirical results showing the superiority of the proposed method over competing approaches in synthetic… ▽ More In this paper, we present a novel Bayesian approach to recover simultaneously block sparse signals in the presence of outliers. The key advantage of our proposed method is the ability to handle non-stationary outliers, i.e. outliers which have time varying support. We validate our approach with empirical results showing the superiority of the proposed method over competing approaches in synthetic data experiments as well as the multiple measurement face recognition problem. △ Less

Submitted 10 May, 2016; v1 submitted 6 May, 2016; originally announced May 2016.

Comments: To appear in ICIP 2016

arXiv:1601.06207 [pdf, other]

Rectified Gaussian Scale Mixtures and the Sparse Non-Negative Least Squares Problem

Authors: Alican Nalci, Igor Fedorov, Maher Al-Shoukairi, Thomas T. Liu, Bhaskar D. Rao

Abstract: In this paper, we develop a Bayesian evidence maximization framework to solve the sparse non-negative least squares (S-NNLS) problem. We introduce a family of probability densities referred to as the Rectified Gaussian Scale Mixture (R- GSM) to model the sparsity enforcing prior distribution for the solution. The R-GSM prior encompasses a variety of heavy-tailed densities such as the rectified Lap… ▽ More In this paper, we develop a Bayesian evidence maximization framework to solve the sparse non-negative least squares (S-NNLS) problem. We introduce a family of probability densities referred to as the Rectified Gaussian Scale Mixture (R- GSM) to model the sparsity enforcing prior distribution for the solution. The R-GSM prior encompasses a variety of heavy-tailed densities such as the rectified Laplacian and rectified Student- t distributions with a proper choice of the mixing density. We utilize the hierarchical representation induced by the R-GSM prior and develop an evidence maximization framework based on the Expectation-Maximization (EM) algorithm. Using the EM based method, we estimate the hyper-parameters and obtain a point estimate for the solution. We refer to the proposed method as rectified sparse Bayesian learning (R-SBL). We provide four R- SBL variants that offer a range of options for computational complexity and the quality of the E-step computation. These methods include the Markov chain Monte Carlo EM, linear minimum mean-square-error estimation, approximate message passing and a diagonal approximation. Using numerical experiments, we show that the proposed R-SBL method outperforms existing S-NNLS solvers in terms of both signal and support recovery performance, and is also very robust against the structure of the design matrix. △ Less

Submitted 27 March, 2018; v1 submitted 22 January, 2016; originally announced January 2016.

Comments: Under Review by IEEE Transactions on Signal Processing

Showing 1–21 of 21 results for author: Fedorov, I