Search | arXiv e-print repository

Measuring model variability using robust non-parametric testing

Authors: Sinjini Banerjee, Tim Marrinan, Reilly Cannon, Tony Chiang, Anand D. Sarwate

Abstract: Training a deep neural network often involves stochastic optimization, meaning each run will produce a different model. The seed used to initialize random elements of the optimization procedure heavily influences the quality of a trained model, which may be obscure from many commonly reported summary statistics, like accuracy. However, random seed is often not included in hyper-parameter optimizat… ▽ More Training a deep neural network often involves stochastic optimization, meaning each run will produce a different model. The seed used to initialize random elements of the optimization procedure heavily influences the quality of a trained model, which may be obscure from many commonly reported summary statistics, like accuracy. However, random seed is often not included in hyper-parameter optimization, perhaps because the relationship between seed and model quality is hard to describe. This work attempts to describe the relationship between deep net models trained with different random seeds and the behavior of the expected model. We adopt robust hypothesis testing to propose a novel summary statistic for network similarity, referred to as the $α$-trimming level. We use the $α$-trimming level to show that the empirical cumulative distribution function of an ensemble model created from a collection of trained models with different random seeds approximates the average of these functions as the number of models in the collection grows large. This insight provides guidance for how many random seeds should be sampled to ensure that an ensemble of these trained models is a reliable representative. We also show that the $α$-trimming level is more expressive than different performance metrics like validation accuracy, churn, or expected calibration error when taken alone and may help with random seed selection in a more principled fashion. We demonstrate the value of the proposed statistic in real experiments and illustrate the advantage of fine-tuning over random seed with an experiment in transfer learning. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2310.00541 [pdf, other]

Robust Nonparametric Hypothesis Testing to Understand Variability in Training Neural Networks

Authors: Sinjini Banerjee, Reilly Cannon, Tim Marrinan, Tony Chiang, Anand D. Sarwate

Abstract: Training a deep neural network (DNN) often involves stochastic optimization, which means each run will produce a different model. Several works suggest this variability is negligible when models have the same performance, which in the case of classification is test accuracy. However, models with similar test accuracy may not be computing the same function. We propose a new measure of closeness bet… ▽ More Training a deep neural network (DNN) often involves stochastic optimization, which means each run will produce a different model. Several works suggest this variability is negligible when models have the same performance, which in the case of classification is test accuracy. However, models with similar test accuracy may not be computing the same function. We propose a new measure of closeness between classification models based on the output of the network before thresholding. Our measure is based on a robust hypothesis-testing framework and can be adapted to other quantities derived from trained models. △ Less

Submitted 30 September, 2023; originally announced October 2023.

arXiv:2308.02922 [pdf, other]

Structured Low-Rank Tensors for Generalized Linear Models

Authors: Batoul Taki, Anand D. Sarwate, Waheed U. Bajwa

Abstract: Recent works have shown that imposing tensor structures on the coefficient tensor in regression problems can lead to more reliable parameter estimation and lower sample complexity compared to vector-based methods. This work investigates a new low-rank tensor model, called Low Separation Rank (LSR), in Generalized Linear Model (GLM) problems. The LSR model -- which generalizes the well-known Tucker… ▽ More Recent works have shown that imposing tensor structures on the coefficient tensor in regression problems can lead to more reliable parameter estimation and lower sample complexity compared to vector-based methods. This work investigates a new low-rank tensor model, called Low Separation Rank (LSR), in Generalized Linear Model (GLM) problems. The LSR model -- which generalizes the well-known Tucker and CANDECOMP/PARAFAC (CP) models, and is a special case of the Block Tensor Decomposition (BTD) model -- is imposed onto the coefficient tensor in the GLM model. This work proposes a block coordinate descent algorithm for parameter estimation in LSR-structured tensor GLMs. Most importantly, it derives a minimax lower bound on the error threshold on estimating the coefficient tensor in LSR tensor GLM problems. The minimax bound is proportional to the intrinsic degrees of freedom in the LSR tensor GLM problem, suggesting that its sample complexity may be significantly lower than that of vectorized GLMs. This result can also be specialised to lower bound the estimation error in CP and Tucker-structured GLMs. The derived bounds are comparable to tight bounds in the literature for Tucker linear regression, and the tightness of the minimax lower bound is further assessed numerically. Finally, numerical experiments on synthetic datasets demonstrate the efficacy of the proposed LSR tensor model for three regression types (linear, logistic and Poisson). Experiments on a collection of medical imaging datasets demonstrate the usefulness of the LSR model over other tensor models (Tucker and CP) on real, imbalanced data with limited available samples. △ Less

Submitted 5 August, 2023; originally announced August 2023.

Comments: 43 pages; published in Transactions on Machine Learning Research (08/2023)

Journal ref: Transactions on Machine Learning Research, Aug. 2023 (https://openreview.net/forum?id=qUxBs3Ln41)

arXiv:2307.11684 [pdf, other]

Minibatching Offers Improved Generalization Performance for Second Order Optimizers

Authors: Eric Silk, Swarnita Chakraborty, Nairanjana Dasgupta, Anand D. Sarwate, Andrew Lumsdaine, Tony Chiang

Abstract: Training deep neural networks (DNNs) used in modern machine learning is computationally expensive. Machine learning scientists, therefore, rely on stochastic first-order methods for training, coupled with significant hand-tuning, to obtain good performance. To better understand performance variability of different stochastic algorithms, including second-order methods, we conduct an empirical study… ▽ More Training deep neural networks (DNNs) used in modern machine learning is computationally expensive. Machine learning scientists, therefore, rely on stochastic first-order methods for training, coupled with significant hand-tuning, to obtain good performance. To better understand performance variability of different stochastic algorithms, including second-order methods, we conduct an empirical study that treats performance as a response variable across multiple training sessions of the same model. Using 2-factor Analysis of Variance (ANOVA) with interactions, we show that batch size used during training has a statistically significant effect on the peak accuracy of the methods, and that full batch largely performed the worst. In addition, we found that second-order optimizers (SOOs) generally exhibited significantly lower variance at specific batch sizes, suggesting they may require less hyperparameter tuning, leading to a reduced overall time to solution for model training. △ Less

Submitted 25 May, 2023; originally announced July 2023.

Comments: 14 pages, 6 figures, 5 tables

arXiv:2205.12372 [pdf, other]

TorchNTK: A Library for Calculation of Neural Tangent Kernels of PyTorch Models

Authors: Andrew Engel, Zhichao Wang, Anand D. Sarwate, Sutanay Choudhury, Tony Chiang

Abstract: We introduce torchNTK, a python library to calculate the empirical neural tangent kernel (NTK) of neural network models in the PyTorch framework. We provide an efficient method to calculate the NTK of multilayer perceptrons. We compare the explicit differentiation implementation against autodifferentiation implementations, which have the benefit of extending the utility of the library to any archi… ▽ More We introduce torchNTK, a python library to calculate the empirical neural tangent kernel (NTK) of neural network models in the PyTorch framework. We provide an efficient method to calculate the NTK of multilayer perceptrons. We compare the explicit differentiation implementation against autodifferentiation implementations, which have the benefit of extending the utility of the library to any architecture supported by PyTorch, such as convolutional networks. A feature of the library is that we expose the user to layerwise NTK components, and show that in some regimes a layerwise calculation is more memory efficient. We conduct preliminary experiments to demonstrate use cases for the software and probe the NTK. △ Less

Submitted 24 May, 2022; originally announced May 2022.

Comments: 19 pages, 5 figures

arXiv:2205.06708 [pdf, ps, other]

The Capacity of Causal Adversarial Channels

Authors: Yihan Zhang, Sidharth Jaggi, Michael Langberg, Anand D. Sarwate

Abstract: We characterize the capacity for the discrete-time arbitrarily varying channel with discrete inputs, outputs, and states when (a) the encoder and decoder do not share common randomness, (b) the input and state are subject to cost constraints, (c) the transition matrix of the channel is deterministic given the state, and (d) at each time step the adversary can only observe the current and past chan… ▽ More We characterize the capacity for the discrete-time arbitrarily varying channel with discrete inputs, outputs, and states when (a) the encoder and decoder do not share common randomness, (b) the input and state are subject to cost constraints, (c) the transition matrix of the channel is deterministic given the state, and (d) at each time step the adversary can only observe the current and past channel inputs when choosing the state at that time. The achievable strategy involves stochastic encoding together with list decoding and a disambiguation step. The converse uses a two-phase "babble-and-push" strategy where the adversary chooses the state randomly in the first phase, list decodes the output, and then chooses state inputs to symmetrize the channel in the second phase. These results generalize prior work on specific channels models (additive, erasure) to general discrete alphabets and models. △ Less

Submitted 13 May, 2022; originally announced May 2022.

arXiv:2202.08260 [pdf, other]

Low-Rank Phase Retrieval with Structured Tensor Models

Authors: Soo Min Kwon, Xin Li, Anand D. Sarwate

Abstract: We study the low-rank phase retrieval problem, where the objective is to recover a sequence of signals (typically images) given the magnitude of linear measurements of those signals. Existing solutions involve recovering a matrix constructed by vectorizing and stacking each image. These algorithms model this matrix to be low-rank and leverage the low-rank property to decrease the sample complexity… ▽ More We study the low-rank phase retrieval problem, where the objective is to recover a sequence of signals (typically images) given the magnitude of linear measurements of those signals. Existing solutions involve recovering a matrix constructed by vectorizing and stacking each image. These algorithms model this matrix to be low-rank and leverage the low-rank property to decrease the sample complexity required for accurate recovery. However, when the number of available measurements is more limited, these low-rank matrix models can often fail. We propose an algorithm called Tucker-Structured Phase Retrieval (TSPR) that models the sequence of images as a tensor rather than a matrix that we factorize using the Tucker decomposition. This factorization reduces the number of parameters that need to be estimated, allowing for a more accurate reconstruction in the under-sampled regime. Interestingly, we observe that this structure also has improved performance in the over-determined setting when the Tucker ranks are chosen appropriately. We demonstrate the effectiveness of our approach on real video datasets under several different measurement models. △ Less

Submitted 15 February, 2022; originally announced February 2022.

Comments: A shorter version of this paper is in 2022 International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

arXiv:2111.14992 [pdf, other]

Network Traffic Shaping for Enhancing Privacy in IoT Systems

Authors: Sijie Xiong, Anand D. Sarwate, Narayan B. Mandayam

Abstract: Motivated by privacy issues caused by inference attacks on user activities in the packet sizes and timing information of Internet of Things (IoT) network traffic, we establish a rigorous event-level differential privacy (DP) model on infinite packet streams. We propose a memoryless traffic shaping mechanism satisfying a first-come-first-served queuing discipline that outputs traffic dependent on t… ▽ More Motivated by privacy issues caused by inference attacks on user activities in the packet sizes and timing information of Internet of Things (IoT) network traffic, we establish a rigorous event-level differential privacy (DP) model on infinite packet streams. We propose a memoryless traffic shaping mechanism satisfying a first-come-first-served queuing discipline that outputs traffic dependent on the input using a DP mechanism. We show that in special cases the proposed mechanism recovers existing shapers which standardize the output independently from the input. To find the optimal shapers for given levels of privacy and transmission efficiency, we formulate the constrained problem of minimizing the expected delay per packet and propose using the expected queue size across time as a proxy. We further show that the constrained minimization is a convex program. We demonstrate the effect of shapers on both synthetic data and packet traces from actual IoT devices. The experimental results reveal inherent privacy-overhead tradeoffs: more shaping overhead provides better privacy protection. Under the same privacy level, there naturally exists a tradeoff between dummy traffic and delay. When dealing with heavier or less bursty input traffic, all shapers become more overhead-efficient. We also show that increased traffic from a larger number of IoT devices makes guaranteeing event-level privacy easier. The DP shaper offers tunable privacy that is invariant with the change in the input traffic distribution and has an advantage in handling burstiness over traffic-independent shapers. This approach well accommodates heterogeneous network conditions and enables users to adapt to their privacy/overhead demands. △ Less

Submitted 29 November, 2021; originally announced November 2021.

Comments: 18 pages, 10 figures, submitted to IEEE Transactions on Networking

arXiv:2105.14673 [pdf, ps, other]

doi 10.1109/IEEECONF53345.2021.9723149

A Minimax Lower Bound for Low-Rank Matrix-Variate Logistic Regression

Authors: Batoul Taki, Mohsen Ghassemi, Anand D. Sarwate, Waheed U. Bajwa

Abstract: This paper considers the problem of matrix-variate logistic regression. It derives the fundamental error threshold on estimating low-rank coefficient matrices in the logistic regression problem by obtaining a lower bound on the minimax risk. The bound depends explicitly on the dimension and distribution of the covariates, the rank and energy of the coefficient matrix, and the number of samples. Th… ▽ More This paper considers the problem of matrix-variate logistic regression. It derives the fundamental error threshold on estimating low-rank coefficient matrices in the logistic regression problem by obtaining a lower bound on the minimax risk. The bound depends explicitly on the dimension and distribution of the covariates, the rank and energy of the coefficient matrix, and the number of samples. The resulting bound is proportional to the intrinsic degrees of freedom in the problem, which suggests the sample complexity of the low-rank matrix logistic regression problem can be lower than that for vectorized logistic regression. The proof techniques utilized in this work also set the stage for development of minimax lower bounds for tensor-variate logistic regression problems. △ Less

Submitted 28 January, 2022; v1 submitted 30 May, 2021; originally announced May 2021.

Comments: 8 pages; published in Proc. 55th Asilomar Conf. Signals, Systems, and Computers, Pacific Grove, CA, Oct. 31-Nov. 3, 2021

arXiv:2012.11877 [pdf, other]

Influencers and the Giant Component: the Fundamental Hardness in Privacy Protection for Socially Contagious Attributes

Authors: Aria Rezaei, Jie Gao, Anand D. Sarwate

Abstract: The presence of correlation is known to make privacy protection more difficult. We investigate the privacy of socially contagious attributes on a network of individuals, where each individual possessing that attribute may influence a number of others into adopting it. We show that for contagions following the Independent Cascade model there exists a giant connected component of infected nodes, con… ▽ More The presence of correlation is known to make privacy protection more difficult. We investigate the privacy of socially contagious attributes on a network of individuals, where each individual possessing that attribute may influence a number of others into adopting it. We show that for contagions following the Independent Cascade model there exists a giant connected component of infected nodes, containing a constant fraction of all the nodes who all receive the contagion from the same set of sources. We further show that it is extremely hard to hide the existence of this giant connected component if we want to obtain an estimate of the activated users at an acceptable level. Moreover, an adversary possessing this knowledge can predict the real status ("active" or "inactive") with decent probability for many of the individuals regardless of the privacy (perturbation) mechanism used. As a case study, we show that the Wasserstein mechanism, a state-of-the-art privacy mechanism designed specifically for correlated data, introduces a noise with magnitude of order $Ω(n)$ in the count estimation in our setting. We provide theoretical guarantees for two classes of random networks: Erdos Renyi graphs and Chung-Lu power-law graphs under the Independent Cascade model. Experiments demonstrate that a giant connected component of infected nodes can and does appear in real-world networks and that a simple inference attack can reveal the status of a good fraction of nodes. △ Less

Submitted 22 December, 2020; originally announced December 2020.

Comments: SIAM SDM 2021, privacy, social contagions, social networks

arXiv:2006.06792 [pdf, other]

Quantile Multi-Armed Bandits: Optimal Best-Arm Identification and a Differentially Private Scheme

Authors: Kontantinos E. Nikolakakis, Dionysios S. Kalogerias, Or Sheffet, Anand D. Sarwate

Abstract: We study the best-arm identification problem in multi-armed bandits with stochastic, potentially private rewards, when the goal is to identify the arm with the highest quantile at a fixed, prescribed level. First, we propose a (non-private) successive elimination algorithm for strictly optimal best-arm identification, we show that our algorithm is $δ$-PAC and we characterize its sample complexity.… ▽ More We study the best-arm identification problem in multi-armed bandits with stochastic, potentially private rewards, when the goal is to identify the arm with the highest quantile at a fixed, prescribed level. First, we propose a (non-private) successive elimination algorithm for strictly optimal best-arm identification, we show that our algorithm is $δ$-PAC and we characterize its sample complexity. Further, we provide a lower bound on the expected number of pulls, showing that the proposed algorithm is essentially optimal up to logarithmic factors. Both upper and lower complexity bounds depend on a special definition of the associated suboptimality gap, designed in particular for the quantile bandit problem, as we show when the gap approaches zero, best-arm identification is impossible. Second, motivated by applications where the rewards are private, we provide a differentially private successive elimination algorithm whose sample complexity is finite even for distributions with infinite support-size, and we characterize its sample complexity. Our algorithms do not require prior knowledge of either the suboptimality gap or other statistical information related to the bandit problem at hand. △ Less

Submitted 4 December, 2022; v1 submitted 11 June, 2020; originally announced June 2020.

Comments: 18 pages, 4 figures

arXiv:1910.12913 [pdf, other]

Improved Differentially Private Decentralized Source Separation for fMRI Data

Authors: Hafiz Imtiaz, Jafar Mohammadi, Rogers Silva, Bradley Baker, Sergey M. Plis, Anand D. Sarwate, Vince Calhoun

Abstract: Blind source separation algorithms such as independent component analysis (ICA) are widely used in the analysis of neuroimaging data. In order to leverage larger sample sizes, different data holders/sites may wish to collaboratively learn feature representations. However, such datasets are often privacy-sensitive, precluding centralized analyses that pool the data at a single site. In this work, w… ▽ More Blind source separation algorithms such as independent component analysis (ICA) are widely used in the analysis of neuroimaging data. In order to leverage larger sample sizes, different data holders/sites may wish to collaboratively learn feature representations. However, such datasets are often privacy-sensitive, precluding centralized analyses that pool the data at a single site. In this work, we propose a differentially private algorithm for performing ICA in a decentralized data setting. Conventional approaches to decentralized differentially private algorithms may introduce too much noise due to the typically small sample sizes at each site. We propose a novel protocol that uses correlated noise to remedy this problem. We show that our algorithm outperforms existing approaches on synthetic and real neuroimaging datasets and demonstrate that it can sometimes reach the same level of utility as the corresponding non-private algorithm. This indicates that it is possible to have meaningful utility while preserving privacy. △ Less

Submitted 22 February, 2021; v1 submitted 28 October, 2019; originally announced October 2019.

Comments: \c{opyright} 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. arXiv admin note: text overlap with arXiv:1904.10059

arXiv:1909.09596 [pdf, other]

Optimal Rates for Learning Hidden Tree Structures

Authors: Konstantinos E. Nikolakakis, Dionysios S. Kalogerias, Anand D. Sarwate

Abstract: We provide high probability finite sample complexity guarantees for hidden non-parametric structure learning of tree-shaped graphical models, whose hidden and observable nodes are discrete random variables with either finite or countable alphabets. We study a fundamental quantity called the (noisy) information threshold, which arises naturally from the error analysis of the Chow-Liu algorithm and,… ▽ More We provide high probability finite sample complexity guarantees for hidden non-parametric structure learning of tree-shaped graphical models, whose hidden and observable nodes are discrete random variables with either finite or countable alphabets. We study a fundamental quantity called the (noisy) information threshold, which arises naturally from the error analysis of the Chow-Liu algorithm and, as we discuss, provides explicit necessary and sufficient conditions on sample complexity, by effectively summarizing the difficulty of the tree-structure learning problem. Specifically, we show that the finite sample complexity of the Chow-Liu algorithm for ensuring exact structure recovery from noisy data is inversely proportional to the information threshold squared (provided it is positive), and scales almost logarithmically relative to the number of nodes over a given probability of failure. Conversely, we show that, if the number of samples is less than an absolute constant times the inverse of information threshold squared, then no algorithm can recover the hidden tree structure with probability greater than one half. As a consequence, our upper and lower bounds match with respect to the information threshold, indicating that it is a fundamental quantity for the problem of learning hidden tree-structured models. Further, the Chow-Liu algorithm with noisy data as input achieves the optimal rate with respect to the information threshold. Lastly, as a byproduct of our analysis, we resolve the problem of tree structure learning in the presence of non-identically distributed observation noise, providing conditions for convergence of the Chow-Liu algorithm under this setting, as well. △ Less

Submitted 31 March, 2021; v1 submitted 20 September, 2019; originally announced September 2019.

Comments: 33 pages, 4 figures

arXiv:1908.08407 [pdf, other]

doi 10.1109/TIT.2021.3091604

Coordination Through Shared Randomness

Authors: Gowtham R. Kurri, Vinod M. Prabhakaran, Anand D. Sarwate

Abstract: We study a distributed sampling problem where a set of processors want to output (approximately) independent and identically distributed samples from a joint distribution with the help of a common message from a coordinator. Each processor has access to a subset of sources from a set of independent sources of "shared" randomness. We consider two cases -- in the "omniscient coordinator setting", th… ▽ More We study a distributed sampling problem where a set of processors want to output (approximately) independent and identically distributed samples from a joint distribution with the help of a common message from a coordinator. Each processor has access to a subset of sources from a set of independent sources of "shared" randomness. We consider two cases -- in the "omniscient coordinator setting", the coordinator has access to all these sources of shared randomness, while in the "oblivious coordinator setting", it has access to none. All processors and the coordinator may privately randomize. In the omniscient coordinator setting, when the subsets at the processors are disjoint (individually shared randomness model), we characterize the rate of communication required from the coordinator to the processors over a multicast link. For the two-processor case, the optimal rate matches a special case of relaxed Wyner's common information proposed by Gastpar and Sula (2019), thereby providing an operational meaning to the latter. We also give an upper bound on the communication rate for the "randomness-on-the-forehead" model where each processor observes all but one source of randomness and we give an achievable strategy for the general case where the processors have access to arbitrary subsets of sources of randomness. Also, we consider a more general model where the processors observe components of correlated sources (with the coordinator observing all the components), where we characterize the communication rate when all the processors wish to output the same random sequence. In the oblivious coordinator setting, we completely characterize the trade-off region between the communication and shared randomness rates for the general case where the processors have access to arbitrary subsets of sources of randomness. △ Less

Submitted 17 June, 2021; v1 submitted 22 August, 2019; originally announced August 2019.

Comments: 27 pages, 7 figures. Some results of this paper were presented at ISIT 2018 and ITW 2019. This paper subsumes arXiv:1805.03193

arXiv:1904.10059 [pdf, other]

Distributed Differentially Private Computation of Functions with Correlated Noise

Authors: Hafiz Imtiaz, Jafar Mohammadi, Anand D. Sarwate

Abstract: Many applications of machine learning, such as human health research, involve processing private or sensitive information. Privacy concerns may impose significant hurdles to collaboration in scenarios where there are multiple sites holding data and the goal is to estimate properties jointly across all datasets. Differentially private decentralized algorithms can provide strong privacy guarantees.… ▽ More Many applications of machine learning, such as human health research, involve processing private or sensitive information. Privacy concerns may impose significant hurdles to collaboration in scenarios where there are multiple sites holding data and the goal is to estimate properties jointly across all datasets. Differentially private decentralized algorithms can provide strong privacy guarantees. However, the accuracy of the joint estimates may be poor when the datasets at each site are small. This paper proposes a new framework, Correlation Assisted Private Estimation (CAPE), for designing privacy-preserving decentralized algorithms with better accuracy guarantees in an honest-but-curious model. CAPE can be used in conjunction with the functional mechanism for statistical and machine learning optimization problems. A tighter characterization of the functional mechanism is provided that allows CAPE to achieve the same performance as a centralized algorithm in the decentralized setting using all datasets. Empirical results on regression and neural network problems for both synthetic and real datasets show that differentially private methods can be competitive with non-private algorithms in many scenarios of interest. △ Less

Submitted 22 February, 2021; v1 submitted 22 April, 2019; originally announced April 2019.

Comments: The manuscript is partially subsumed by arXiv:1910.12913

arXiv:1903.09284 [pdf, other]

doi 10.1109/TSP.2019.2952046

Learning Mixtures of Separable Dictionaries for Tensor Data: Analysis and Algorithms

Authors: Mohsen Ghassemi, Zahra Shakeri, Anand D. Sarwate, Waheed U. Bajwa

Abstract: This work addresses the problem of learning sparse representations of tensor data using structured dictionary learning. It proposes learning a mixture of separable dictionaries to better capture the structure of tensor data by generalizing the separable dictionary learning model. Two different approaches for learning mixture of separable dictionaries are explored and sufficient conditions for loca… ▽ More This work addresses the problem of learning sparse representations of tensor data using structured dictionary learning. It proposes learning a mixture of separable dictionaries to better capture the structure of tensor data by generalizing the separable dictionary learning model. Two different approaches for learning mixture of separable dictionaries are explored and sufficient conditions for local identifiability of the underlying dictionary are derived in each case. Moreover, computational algorithms are developed to solve the problem of learning mixture of separable dictionaries in both batch and online settings. Numerical experiments are used to show the usefulness of the proposed model and the efficacy of the developed algorithms. △ Less

Submitted 13 June, 2020; v1 submitted 21 March, 2019; originally announced March 2019.

Comments: 18 pages, 4 figures, 3 tables; Published in IEEE Trans. Signal Processing

Journal ref: IEEE Trans. Signal Processing, vol. 68, pp. 33-48, 2020

arXiv:1812.04700 [pdf, other]

Predictive Learning on Hidden Tree-Structured Ising Models

Authors: Konstantinos E. Nikolakakis, Dionysios S. Kalogerias, Anand D. Sarwate

Abstract: We provide high-probability sample complexity guarantees for exact structure recovery and accurate predictive learning using noise-corrupted samples from an acyclic (tree-shaped) graphical model. The hidden variables follow a tree-structured Ising model distribution, whereas the observable variables are generated by a binary symmetric channel taking the hidden variables as its input (flipping each… ▽ More We provide high-probability sample complexity guarantees for exact structure recovery and accurate predictive learning using noise-corrupted samples from an acyclic (tree-shaped) graphical model. The hidden variables follow a tree-structured Ising model distribution, whereas the observable variables are generated by a binary symmetric channel taking the hidden variables as its input (flipping each bit independently with some constant probability $q\in [0,1/2)$). In the absence of noise, predictive learning on Ising models was recently studied by Bresler and Karzand (2020); this paper quantifies how noise in the hidden model impacts the tasks of structure recovery and marginal distribution estimation by proving upper and lower bounds on the sample complexity. Our results generalize state-of-the-art bounds reported in prior work, and they exactly recover the noiseless case ($q=0$). In fact, for any tree with $p$ vertices and probability of incorrect recovery $δ>0$, the sufficient number of samples remains logarithmic as in the noiseless case, i.e., $\mathcal{O}(\log(p/δ))$, while the dependence on $q$ is $\mathcal{O}\big( 1/(1-2q)^{4} \big)$, for both aforementioned tasks. We also present a new equivalent of Isserlis' Theorem for sign-valued tree-structured distributions, yielding a new low-complexity algorithm for higher-order moment estimation. △ Less

Submitted 16 February, 2021; v1 submitted 11 December, 2018; originally announced December 2018.

Comments: 82 pages, 8 figures

arXiv:1805.03319 [pdf, other]

Quadratically Constrained Channels with Causal Adversaries

Authors: Tongxin Li, Bikash Kumar Dey, Sidharth Jaggi, Michael Langberg, Anand D. Sarwate

Abstract: We consider the problem of communication over a channel with a causal jamming adversary subject to quadratic constraints. A sender Alice wishes to communicate a message to a receiver Bob by transmitting a real-valued length-$n$ codeword $\mathbf{x}=x_1,...,x_n$ through a communication channel. Alice and Bob do not share common randomness. Knowing Alice's encoding strategy, an adversarial jammer Ja… ▽ More We consider the problem of communication over a channel with a causal jamming adversary subject to quadratic constraints. A sender Alice wishes to communicate a message to a receiver Bob by transmitting a real-valued length-$n$ codeword $\mathbf{x}=x_1,...,x_n$ through a communication channel. Alice and Bob do not share common randomness. Knowing Alice's encoding strategy, an adversarial jammer James chooses a real-valued length-n noise sequence $\mathbf{s}=s_1,..,s_n$ in a causal manner, i.e., each $s_t (1<=t<=n)$ can only depend on $x_1,...,x_t$. Bob receives $\mathbf{y}$, the sum of Alice's transmission $\mathbf{x}$ and James' jamming vector $\mathbf{s}$, and is required to reliably estimate Alice's message from this sum. In addition, Alice and James's transmission powers are restricted by quadratic constraints $P>0$ and $N>0$. In this work, we characterize the channel capacity for such a channel as the limit superior of the optimal values of a series of optimizations. Upper and lower bounds on the optimal values are provided both analytically and numerically. Interestingly, unlike many communication problems, in this causal setting Alice's optimal codebook may not have a uniform power allocation - for certain SNR, a codebook with a two-level uniform power allocation results in a strictly higher rate than a codebook with a uniform power allocation would. △ Less

Submitted 8 May, 2018; originally announced May 2018.

Comments: 80 pages, ISIT 2018

arXiv:1805.03193 [pdf, other]

Coordination Using Individually Shared Randomness

Authors: Gowtham R. Kurri, Vinod M. Prabhakaran, Anand D. Sarwate

Abstract: Two processors output correlated sequences using the help of a coordinator with whom they individually share independent randomness. For the case of unlimited shared randomness, we characterize the rate of communication required from the coordinator to the processors over a broadcast link. We also give an achievable trade-off between the communication and shared randomness rates. Two processors output correlated sequences using the help of a coordinator with whom they individually share independent randomness. For the case of unlimited shared randomness, we characterize the rate of communication required from the coordinator to the processors over a broadcast link. We also give an achievable trade-off between the communication and shared randomness rates. △ Less

Submitted 8 May, 2018; originally announced May 2018.

Comments: Extended version of a paper accepted for presentation at ISIT 2018. 8 pages, 3 figures

arXiv:1804.10299 [pdf, other]

doi 10.1109/JSTSP.2018.2877842

Distributed Differentially-Private Algorithms for Matrix and Tensor Factorization

Authors: Hafiz Imtiaz, Anand D. Sarwate

Abstract: In many signal processing and machine learning applications, datasets containing private information are held at different locations, requiring the development of distributed privacy-preserving algorithms. Tensor and matrix factorizations are key components of many processing pipelines. In the distributed setting, differentially private algorithms suffer because they introduce noise to guarantee p… ▽ More In many signal processing and machine learning applications, datasets containing private information are held at different locations, requiring the development of distributed privacy-preserving algorithms. Tensor and matrix factorizations are key components of many processing pipelines. In the distributed setting, differentially private algorithms suffer because they introduce noise to guarantee privacy. This paper designs new and improved distributed and differentially private algorithms for two popular matrix and tensor factorization methods: principal component analysis (PCA) and orthogonal tensor decomposition (OTD). The new algorithms employ a correlated noise design scheme to alleviate the effects of noise and can achieve the same noise level as the centralized scenario. Experiments on synthetic and real data illustrate the regimes in which the correlated noise allows performance matching with the centralized setting, outperforming previous methods and demonstrating that meaningful utility is possible while guaranteeing differential privacy. △ Less

Submitted 26 April, 2018; originally announced April 2018.

Comments: 39 pages, in review for publication

Journal ref: IEEE Journal of Selected Topics in Signal Proessing 2018

arXiv:1712.03471 [pdf, other]

doi 10.1109/JSTSP.2018.2838092

Identifiability of Kronecker-structured Dictionaries for Tensor Data

Authors: Zahra Shakeri, Anand D. Sarwate, Waheed U. Bajwa

Abstract: This paper derives sufficient conditions for local recovery of coordinate dictionaries comprising a Kronecker-structured dictionary that is used for representing $K$th-order tensor data. Tensor observations are assumed to be generated from a Kronecker-structured dictionary multiplied by sparse coefficient tensors that follow the separable sparsity model. This work provides sufficient conditions on… ▽ More This paper derives sufficient conditions for local recovery of coordinate dictionaries comprising a Kronecker-structured dictionary that is used for representing $K$th-order tensor data. Tensor observations are assumed to be generated from a Kronecker-structured dictionary multiplied by sparse coefficient tensors that follow the separable sparsity model. This work provides sufficient conditions on the underlying coordinate dictionaries, coefficient and noise distributions, and number of samples that guarantee recovery of the individual coordinate dictionaries up to a specified error, as a local minimum of the objective function, with high probability. In particular, the sample complexity to recover $K$ coordinate dictionaries with dimensions $m_k \times p_k$ up to estimation error $\varepsilon_k$ is shown to be $\max_{k \in [K]}\mathcal{O}(m_kp_k^3\varepsilon_k^{-2})$. △ Less

Submitted 25 May, 2018; v1 submitted 10 December, 2017; originally announced December 2017.

Comments: 16 pages, to appear in IEEE Journal of Special Topics in Signal Processing

Journal ref: IEEE J. Sel. Topics Signal Processing, vol. 12, no. 5, pp. 1047-1062, Oct. 2018

arXiv:1711.04887 [pdf, other]

STARK: Structured Dictionary Learning Through Rank-one Tensor Recovery

Authors: Mohsen Ghassemi, Zahra Shakeri, Anand D. Sarwate, Waheed U. Bajwa

Abstract: In recent years, a class of dictionaries have been proposed for multidimensional (tensor) data representation that exploit the structure of tensor data by imposing a Kronecker structure on the dictionary underlying the data. In this work, a novel algorithm called "STARK" is provided to learn Kronecker structured dictionaries that can represent tensors of any order. By establishing that the Kroneck… ▽ More In recent years, a class of dictionaries have been proposed for multidimensional (tensor) data representation that exploit the structure of tensor data by imposing a Kronecker structure on the dictionary underlying the data. In this work, a novel algorithm called "STARK" is provided to learn Kronecker structured dictionaries that can represent tensors of any order. By establishing that the Kronecker product of any number of matrices can be rearranged to form a rank-1 tensor, we show that Kronecker structure can be enforced on the dictionary by solving a rank-1 tensor recovery problem. Because rank-1 tensor recovery is a challenging nonconvex problem, we resort to solving a convex relaxation of this problem. Empirical experiments on synthetic and real data show promising results for our proposed algorithm. △ Less

Submitted 13 November, 2017; originally announced November 2017.

arXiv:1705.09905 [pdf, other]

doi 10.1109/CLUSTER.2017.75

A Unified Optimization Approach for Sparse Tensor Operations on GPUs

Authors: Bangtian Liu, Chengyao Wen, Anand D. Sarwate, Maryam Mehri Dehnavi

Abstract: Sparse tensors appear in many large-scale applications with multidimensional and sparse data. While multidimensional sparse data often need to be processed on manycore processors, attempts to develop highly-optimized GPU-based implementations of sparse tensor operations are rare. The irregular computation patterns and sparsity structures as well as the large memory footprints of sparse tensor oper… ▽ More Sparse tensors appear in many large-scale applications with multidimensional and sparse data. While multidimensional sparse data often need to be processed on manycore processors, attempts to develop highly-optimized GPU-based implementations of sparse tensor operations are rare. The irregular computation patterns and sparsity structures as well as the large memory footprints of sparse tensor operations make such implementations challenging. We leverage the fact that sparse tensor operations share similar computation patterns to propose a unified tensor representation called F-COO. Combined with GPU-specific optimizations, F-COO provides highly-optimized implementations of sparse tensor computations on GPUs. The performance of the proposed unified approach is demonstrated for tensor-based kernels such as the Sparse Matricized Tensor- Times-Khatri-Rao Product (SpMTTKRP) and the Sparse Tensor- Times-Matrix Multiply (SpTTM) and is used in tensor decomposition algorithms. Compared to state-of-the-art work we improve the performance of SpTTM and SpMTTKRP up to 3.7 and 30.6 times respectively on NVIDIA Titan-X GPUs. We implement a CANDECOMP/PARAFAC (CP) decomposition and achieve up to 14.9 times speedup using the unified method over state-of-the-art libraries on NVIDIA Titan-X GPUs. △ Less

Submitted 28 May, 2017; originally announced May 2017.

arXiv:1608.02792 [pdf, other]

doi 10.1109/TIT.2018.2799931

Minimax Lower Bounds on Dictionary Learning for Tensor Data

Authors: Zahra Shakeri, Waheed U. Bajwa, Anand D. Sarwate

Abstract: This paper provides fundamental limits on the sample complexity of estimating dictionaries for tensor data. The specific focus of this work is on $K$th-order tensor data and the case where the underlying dictionary can be expressed in terms of $K$ smaller dictionaries. It is assumed the data are generated by linear combinations of these structured dictionary atoms and observed through white Gaussi… ▽ More This paper provides fundamental limits on the sample complexity of estimating dictionaries for tensor data. The specific focus of this work is on $K$th-order tensor data and the case where the underlying dictionary can be expressed in terms of $K$ smaller dictionaries. It is assumed the data are generated by linear combinations of these structured dictionary atoms and observed through white Gaussian noise. This work first provides a general lower bound on the minimax risk of dictionary learning for such tensor data and then adapts the proof techniques for specialized results in the case of sparse and sparse-Gaussian linear combinations. The results suggest the sample complexity of dictionary learning for tensor data can be significantly lower than that for unstructured data: for unstructured data it scales linearly with the product of the dictionary dimensions, whereas for tensor-structured data the bound scales linearly with the sum of the product of the dimensions of the (smaller) component dictionaries. A partial converse is provided for the case of 2nd-order tensor data to show that the bounds in this paper can be tight. This involves developing an algorithm for learning highly-structured dictionaries from noisy tensor data. Finally, numerical experiments highlight the advantages associated with explicitly accounting for tensor data structure during dictionary learning. △ Less

Submitted 18 February, 2018; v1 submitted 9 August, 2016; originally announced August 2016.

Comments: In IEEE Transactions on Information Theory

Journal ref: IEEE Trans. Inform. Theory, vol. 64, no. 4, pp. 2706-2726, Apr. 2018

arXiv:1605.05284 [pdf, other]

doi 10.1109/ISIT.2016.7541479

Minimax Lower Bounds for Kronecker-Structured Dictionary Learning

Authors: Zahra Shakeri, Waheed U. Bajwa, Anand D. Sarwate

Abstract: Dictionary learning is the problem of estimating the collection of atomic elements that provide a sparse representation of measured/collected signals or data. This paper finds fundamental limits on the sample complexity of estimating dictionaries for tensor data by proving a lower bound on the minimax risk. This lower bound depends on the dimensions of the tensor and parameters of the generative m… ▽ More Dictionary learning is the problem of estimating the collection of atomic elements that provide a sparse representation of measured/collected signals or data. This paper finds fundamental limits on the sample complexity of estimating dictionaries for tensor data by proving a lower bound on the minimax risk. This lower bound depends on the dimensions of the tensor and parameters of the generative model. The focus of this paper is on second-order tensor data, with the underlying dictionaries constructed by taking the Kronecker product of two smaller dictionaries and the observed data generated by sparse linear combinations of dictionary atoms observed through white Gaussian noise. In this regard, the paper provides a general lower bound on the minimax risk and also adapts the proof techniques for equivalent results using sparse and Gaussian coefficient models. The reported results suggest that the sample complexity of dictionary learning for tensor data can be significantly lower than that for unstructured data. △ Less

Submitted 17 May, 2016; originally announced May 2016.

Comments: 5 pages, 1 figure. To appear in 2016 IEEE International Symposium on Information Theory

Journal ref: Proc. IEEE Intl. Symp. Information Theory, Barcelona, Spain, Jul. 10-15, 2016, pp. 1148-1152

arXiv:1602.03571 [pdf, other]

High Dimensional Inference with Random Maximum A-Posteriori Perturbations

Authors: Tamir Hazan, Francesco Orabona, Anand D. Sarwate, Subhransu Maji, Tommi Jaakkola

Abstract: This paper presents a new approach, called perturb-max, for high-dimensional statistical inference that is based on applying random perturbations followed by optimization. This framework injects randomness to maximum a-posteriori (MAP) predictors by randomly perturbing the potential function for the input. A classic result from extreme value statistics asserts that perturb-max operations generate… ▽ More This paper presents a new approach, called perturb-max, for high-dimensional statistical inference that is based on applying random perturbations followed by optimization. This framework injects randomness to maximum a-posteriori (MAP) predictors by randomly perturbing the potential function for the input. A classic result from extreme value statistics asserts that perturb-max operations generate unbiased samples from the Gibbs distribution using high-dimensional perturbations. Unfortunately, the computational cost of generating so many high-dimensional random variables can be prohibitive. However, when the perturbations are of low dimension, sampling the perturb-max prediction is as efficient as MAP optimization. This paper shows that the expected value of perturb-max inference with low dimensional perturbations can be used sequentially to generate unbiased samples from the Gibbs distribution. Furthermore the expected value of the maximal perturbations is a natural bound on the entropy of such perturb-max models. A measure concentration result for perturb-max values shows that the deviation of their sampled average from its expectation decays exponentially in the number of samples, allowing effective approximation of the expectation. △ Less

Submitted 30 May, 2017; v1 submitted 10 February, 2016; originally announced February 2016.

Comments: 47 pages, 10 figures, under review

arXiv:1602.02384 [pdf, other]

The benefit of a 1-bit jump-start, and the necessity of stochastic encoding, in jamming channels

Authors: Bikash Kumar Dey, Sidharth Jaggi, Michael Langberg, Anand D. Sarwate

Abstract: We consider the problem of communicating a message $m$ in the presence of a malicious jamming adversary (Calvin), who can erase an arbitrary set of up to $pn$ bits, out of $n$ transmitted bits $(x_1,\ldots,x_n)$. The capacity of such a channel when Calvin is exactly causal, i.e. Calvin's decision of whether or not to erase bit $x_i$ depends on his observations $(x_1,\ldots,x_i)$ was recently chara… ▽ More We consider the problem of communicating a message $m$ in the presence of a malicious jamming adversary (Calvin), who can erase an arbitrary set of up to $pn$ bits, out of $n$ transmitted bits $(x_1,\ldots,x_n)$. The capacity of such a channel when Calvin is exactly causal, i.e. Calvin's decision of whether or not to erase bit $x_i$ depends on his observations $(x_1,\ldots,x_i)$ was recently characterized to be $1-2p$. In this work we show two (perhaps) surprising phenomena. Firstly, we demonstrate via a novel code construction that if Calvin is delayed by even a single bit, i.e. Calvin's decision of whether or not to erase bit $x_i$ depends only on $(x_1,\ldots,x_{i-1})$ (and is independent of the "current bit" $x_i$) then the capacity increases to $1-p$ when the encoder is allowed to be stochastic. Secondly, we show via a novel jamming strategy for Calvin that, in the single-bit-delay setting, if the encoding is deterministic (i.e. the transmitted codeword is a deterministic function of the message $m$) then no rate asymptotically larger than $1-2p$ is possible with vanishing probability of error, hence stochastic encoding (using private randomness at the encoder) is essential to achieve the capacity of $1-p$ against a one-bit-delayed Calvin. △ Less

Submitted 7 February, 2016; originally announced February 2016.

Comments: 21 pages, 4 figures, extended draft of submission to ISIT 2016

arXiv:1508.01818 [pdf, other]

Designing Incentive Schemes For Privacy-Sensitive Users

Authors: Chong Huang, Lalitha Sankar, Anand D. Sarwate

Abstract: Businesses (retailers) often wish to offer personalized advertisements (coupons) to individuals (consumers), but run the risk of strong reactions from consumers who want a customized shopping experience but feel their privacy has been violated. Existing models for privacy such as differential privacy or information theory try to quantify privacy risk but do not capture the subjective experience an… ▽ More Businesses (retailers) often wish to offer personalized advertisements (coupons) to individuals (consumers), but run the risk of strong reactions from consumers who want a customized shopping experience but feel their privacy has been violated. Existing models for privacy such as differential privacy or information theory try to quantify privacy risk but do not capture the subjective experience and heterogeneous expression of privacy-sensitivity. We propose a Markov decision process (MDP) model to capture (i) different consumer privacy sensitivities via a time-varying state; (ii) different coupon types (action set) for the retailer; and (iii) the action-and-state-dependent cost for perceived privacy violations. For the simple case with two states ("Normal" and "Alerted"), two coupons (targeted and untargeted) model, and consumer behavior statistics known to the retailer, we show that a stationary threshold-based policy is the optimal coupon-offering strategy for a retailer that wishes to minimize its expected discounted cost. The threshold is a function of all model parameters; the retailer offers a targeted coupon if their belief that the consumer is in the "Alerted" state is below the threshold. We extend this two-state model to consumers with multiple privacy-sensitivity states as well as coupon-dependent state transition probabilities. Furthermore, we study the case with imperfect (noisy) cost feedback from consumers and uncertain initial belief state. △ Less

Submitted 23 September, 2015; v1 submitted 7 August, 2015; originally announced August 2015.

Comments: 25 pages, 10 figures, submitted to journal of privacy and confidentiality

arXiv:1412.5617 [pdf, other]

Learning from Data with Heterogeneous Noise using SGD

Authors: Shuang Song, Kamalika Chaudhuri, Anand D. Sarwate

Abstract: We consider learning from data of variable quality that may be obtained from different heterogeneous sources. Addressing learning from heterogeneous data in its full generality is a challenging problem. In this paper, we adopt instead a model in which data is observed through heterogeneous noise, where the noise level reflects the quality of the data source. We study how to use stochastic gradient… ▽ More We consider learning from data of variable quality that may be obtained from different heterogeneous sources. Addressing learning from heterogeneous data in its full generality is a challenging problem. In this paper, we adopt instead a model in which data is observed through heterogeneous noise, where the noise level reflects the quality of the data source. We study how to use stochastic gradient algorithms to learn in this model. Our study is motivated by two concrete examples where this problem arises naturally: learning with local differential privacy based on data from multiple sources with different privacy requirements, and learning from data with labels of variable quality. The main contribution of this paper is to identify how heterogeneous noise impacts performance. We show that given two datasets with heterogeneous noise, the order in which to use them in standard SGD depends on the learning rate. We propose a method for changing the learning rate as a function of the heterogeneity, and prove new regret bounds for our method in two cases of interest. Experiments on real data show that our method performs better than using a single learning rate and using only the less noisy of the two datasets when the noise level is low to moderate. △ Less

Submitted 17 December, 2014; originally announced December 2014.

arXiv:1409.7614 [pdf, other]

Generalized Opinion Dynamics from Local Optimization Rules

Authors: Avhishek Chatterjee, Anand D. Sarwate, Sriram Vishwanath

Abstract: We study generalizations of the Hegselmann-Krause (HK) model for opinion dynamics, incorporating features and parameters that are natural components of observed social systems. The first generalization is one where the strength of influence depends on the distance of the agents' opinions. Under this setup, we identify conditions under which the opinions converge in finite time, and provide a quali… ▽ More We study generalizations of the Hegselmann-Krause (HK) model for opinion dynamics, incorporating features and parameters that are natural components of observed social systems. The first generalization is one where the strength of influence depends on the distance of the agents' opinions. Under this setup, we identify conditions under which the opinions converge in finite time, and provide a qualitative characterization of the equilibrium. We interpret the HK model opinion update rule as a quadratic cost-minimization rule. This enables a second generalization: a family of update rules which possess different equilibrium properties. Subsequently, we investigate models in which a external force can behave strategically to modulate/influence user updates. We consider cases where this external force can introduce additional agents and cases where they can modify the cost structures for other agents. We describe and analyze some strategies through which such modulation may be possible in an order-optimal manner. Our simulations demonstrate that generalized dynamics differ qualitatively and quantitatively from traditional HK dynamics. △ Less

Submitted 25 September, 2014; originally announced September 2014.

Comments: 20 pages, under review

arXiv:1407.5383 [pdf, other]

doi 10.3390/e16105339

Redundancy of Exchangeable Estimators

Authors: Narayana P. Santhanam, Anand D. Sarwate, Jae Oh Woo

Abstract: Exchangeable random partition processes are the basis for Bayesian approaches to statistical inference in large alphabet settings. On the other hand, the notion of the pattern of a sequence provides an information-theoretic framework for data compression in large alphabet scenarios. Because data compression and parameter estimation are intimately related, we study the redundancy of Bayes estimator… ▽ More Exchangeable random partition processes are the basis for Bayesian approaches to statistical inference in large alphabet settings. On the other hand, the notion of the pattern of a sequence provides an information-theoretic framework for data compression in large alphabet scenarios. Because data compression and parameter estimation are intimately related, we study the redundancy of Bayes estimators coming from Poisson-Dirichlet priors (or "Chinese restaurant processes") and the Pitman-Yor prior. This provides an understanding of these estimators in the setting of unknown discrete alphabets from the perspective of universal compression. In particular, we identify relations between alphabet sizes and sample sizes where the redundancy is small, thereby characterizing useful regimes for these estimators. △ Less

Submitted 20 October, 2014; v1 submitted 21 July, 2014; originally announced July 2014.

Comments: 18 pages

arXiv:1310.4227 [pdf, other]

On Measure Concentration of Random Maximum A-Posteriori Perturbations

Authors: Francesco Orabona, Tamir Hazan, Anand D. Sarwate, Tommi Jaakkola

Abstract: The maximum a-posteriori (MAP) perturbation framework has emerged as a useful approach for inference and learning in high dimensional complex models. By maximizing a randomly perturbed potential function, MAP perturbations generate unbiased samples from the Gibbs distribution. Unfortunately, the computational cost of generating so many high-dimensional random variables can be prohibitive. More eff… ▽ More The maximum a-posteriori (MAP) perturbation framework has emerged as a useful approach for inference and learning in high dimensional complex models. By maximizing a randomly perturbed potential function, MAP perturbations generate unbiased samples from the Gibbs distribution. Unfortunately, the computational cost of generating so many high-dimensional random variables can be prohibitive. More efficient algorithms use sequential sampling strategies based on the expected value of low dimensional MAP perturbations. This paper develops new measure concentration inequalities that bound the number of samples needed to estimate such expected values. Applying the general result to MAP perturbations can yield a more efficient algorithm to approximate sampling from the Gibbs distribution. The measure concentration result is of general interest and may be applicable to other areas involving expected estimations. △ Less

Submitted 15 October, 2013; originally announced October 2013.

arXiv:1306.2347 [pdf, other]

Auditing: Active Learning with Outcome-Dependent Query Costs

Authors: Sivan Sabato, Anand D. Sarwate, Nathan Srebro

Abstract: We propose a learning setting in which unlabeled data is free, and the cost of a label depends on its value, which is not known in advance. We study binary classification in an extreme case, where the algorithm only pays for negative labels. Our motivation are applications such as fraud detection, in which investigating an honest transaction should be avoided if possible. We term the setting audit… ▽ More We propose a learning setting in which unlabeled data is free, and the cost of a label depends on its value, which is not known in advance. We study binary classification in an extreme case, where the algorithm only pays for negative labels. Our motivation are applications such as fraud detection, in which investigating an honest transaction should be avoided if possible. We term the setting auditing, and consider the auditing complexity of an algorithm: the number of negative labels the algorithm requires in order to learn a hypothesis with low relative error. We design auditing algorithms for simple hypothesis classes (thresholds and rectangles), and show that with these algorithms, the auditing complexity can be significantly lower than the active label complexity. We also discuss a general competitive approach for auditing and possible modifications to the framework. △ Less

Submitted 12 July, 2015; v1 submitted 10 June, 2013; originally announced June 2013.

Comments: Corrections in section 5

Journal ref: Neural Information Processing Systems 26 (NIPS), 512-520, 2013

arXiv:1305.4548 [pdf, other]

Distributed Learning of Distributions via Social Sampling

Authors: Anand D. Sarwate, Tara Javidi

Abstract: A protocol for distributed estimation of discrete distributions is proposed. Each agent begins with a single sample from the distribution, and the goal is to learn the empirical distribution of the samples. The protocol is based on a simple message-passing model motivated by communication in social networks. Agents sample a message randomly from their current estimates of the distribution, resulti… ▽ More A protocol for distributed estimation of discrete distributions is proposed. Each agent begins with a single sample from the distribution, and the goal is to learn the empirical distribution of the samples. The protocol is based on a simple message-passing model motivated by communication in social networks. Agents sample a message randomly from their current estimates of the distribution, resulting in a protocol with quantized messages. Using tools from stochastic approximation, the algorithm is shown to converge almost surely. Examples illustrate three regimes with different consensus phenomena. Simulations demonstrate this convergence and give some insight into the effect of network topology. △ Less

Submitted 5 June, 2014; v1 submitted 20 May, 2013; originally announced May 2013.

Comments: 17 pages, accepted to IEEE Transactions on Automatic Control

arXiv:1209.2755 [pdf, ps, other]

Relaxing the Gaussian AVC

Authors: Anand D. Sarwate, Michael Gastpar

Abstract: The arbitrarily varying channel (AVC) is a conservative way of modeling an unknown interference, and the corresponding capacity results are pessimistic. We reconsider the Gaussian AVC by relaxing the classical model and thereby weakening the adversarial nature of the interference. We examine three different relaxations. First, we show how a very small amount of common randomness between transmitte… ▽ More The arbitrarily varying channel (AVC) is a conservative way of modeling an unknown interference, and the corresponding capacity results are pessimistic. We reconsider the Gaussian AVC by relaxing the classical model and thereby weakening the adversarial nature of the interference. We examine three different relaxations. First, we show how a very small amount of common randomness between transmitter and receiver is sufficient to achieve the rates of fully randomized codes. Second, akin to the dirty paper coding problem, we study the impact of an additional interference known to the transmitter. We provide partial capacity results that differ significantly from the standard AVC. Third, we revisit a Gaussian MIMO AVC in which the interference is arbitrary but of limited dimension. △ Less

Submitted 12 September, 2012; originally announced September 2012.

Comments: Submitted to the IEEE Transactions on Information Theory

arXiv:1207.2812 [pdf, other]

Near-Optimal Algorithms for Differentially-Private Principal Components

Authors: Kamalika Chaudhuri, Anand D. Sarwate, Kaushik Sinha

Abstract: Principal components analysis (PCA) is a standard tool for identifying good low-dimensional approximations to data in high dimension. Many data sets of interest contain private or sensitive information about individuals. Algorithms which operate on such data should be sensitive to the privacy risks in publishing their outputs. Differential privacy is a framework for developing tradeoffs between pr… ▽ More Principal components analysis (PCA) is a standard tool for identifying good low-dimensional approximations to data in high dimension. Many data sets of interest contain private or sensitive information about individuals. Algorithms which operate on such data should be sensitive to the privacy risks in publishing their outputs. Differential privacy is a framework for developing tradeoffs between privacy and the utility of these outputs. In this paper we investigate the theory and empirical performance of differentially private approximations to PCA and propose a new method which explicitly optimizes the utility of the output. We show that the sample complexity of the proposed method differs from the existing procedure in the scaling with the data dimension, and that our method is nearly optimal in terms of this scaling. We furthermore illustrate our results, showing that on real data there is a large performance gap between the existing method and our method. △ Less

Submitted 7 August, 2013; v1 submitted 11 July, 2012; originally announced July 2012.

Comments: 37 pages, 8 figures; final version to appear in the Journal of Machine Learning Research, preliminary version was at NIPS 2012

arXiv:1204.2587 [pdf, ps, other]

doi 10.1109/TIT.2013.2245721

Upper Bounds on the Capacity of Binary Channels with Causal Adversaries

Authors: Bikash Kumar Dey, Sidharth Jaggi, Michael Langberg, Anand D. Sarwate

Abstract: In this work we consider the communication of information in the presence of a causal adversarial jammer. In the setting under study, a sender wishes to communicate a message to a receiver by transmitting a codeword $(x_1,...,x_n)$ bit-by-bit over a communication channel. The sender and the receiver do not share common randomness. The adversarial jammer can view the transmitted bits $x_i$ one at a… ▽ More In this work we consider the communication of information in the presence of a causal adversarial jammer. In the setting under study, a sender wishes to communicate a message to a receiver by transmitting a codeword $(x_1,...,x_n)$ bit-by-bit over a communication channel. The sender and the receiver do not share common randomness. The adversarial jammer can view the transmitted bits $x_i$ one at a time, and can change up to a $p$-fraction of them. However, the decisions of the jammer must be made in a causal manner. Namely, for each bit $x_i$ the jammer's decision on whether to corrupt it or not must depend only on $x_j$ for $j \leq i$. This is in contrast to the "classical" adversarial jamming situations in which the jammer has no knowledge of $(x_1,...,x_n)$, or knows $(x_1,...,x_n)$ completely. In this work, we present upper bounds (that hold under both the average and maximal probability of error criteria) on the capacity which hold for both deterministic and stochastic encoding schemes. △ Less

Submitted 13 December, 2012; v1 submitted 11 April, 2012; originally announced April 2012.

Comments: To appear in the IEEE Transactions on Information Theory; shortened version appeared at ISIT 2012

arXiv:0912.0071 [pdf, ps, other]

Differentially Private Empirical Risk Minimization

Authors: Kamalika Chaudhuri, Claire Monteleoni, Anand D. Sarwate

Abstract: Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the $ε$-differential privacy definition due… ▽ More Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the $ε$-differential privacy definition due to Dwork et al. (2006). First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classification. Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design. This method entails perturbing the objective function before optimizing over classifiers. If the loss and regularizer satisfy certain convexity and differentiability criteria, we prove theoretical results showing that our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. We further present a privacy-preserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply these results to produce privacy-preserving analogues of regularized logistic regression and support vector machines. We obtain encouraging results from evaluating their performance on real demographic and benchmark data sets. Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance. △ Less

Submitted 16 February, 2011; v1 submitted 30 November, 2009; originally announced December 2009.

Comments: 40 pages, 7 figures, accepted to the Journal of Machine Learning Research

arXiv:0907.1413

Privacy constraints in regularized convex optimization

Authors: Kamalika Chaudhuri, Anand D. Sarwate

Abstract: This paper is withdrawn due to some errors, which are corrected in arXiv:0912.0071v4 [cs.LG]. This paper is withdrawn due to some errors, which are corrected in arXiv:0912.0071v4 [cs.LG]. △ Less

Submitted 21 June, 2011; v1 submitted 9 July, 2009; originally announced July 2009.

Comments: This paper has been withdrawn by the authors due to some errors. Corrections have been included in arXiv:0912.0071v4

arXiv:0810.2513 [pdf, ps, other]

The Impact of Mobility on Gossip Algorithms

Authors: Anand D. Sarwate, Alexandros G. Dimakis

Abstract: The influence of node mobility on the convergence time of averaging gossip algorithms in networks is studied. It is shown that a small number of fully mobile nodes can yield a significant decrease in convergence time. A method is developed for deriving lower bounds on the convergence time by merging nodes according to their mobility pattern. This method is used to show that if the agents have one-… ▽ More The influence of node mobility on the convergence time of averaging gossip algorithms in networks is studied. It is shown that a small number of fully mobile nodes can yield a significant decrease in convergence time. A method is developed for deriving lower bounds on the convergence time by merging nodes according to their mobility pattern. This method is used to show that if the agents have one-dimensional mobility in the same direction the convergence time is improved by at most a constant. Upper bounds are obtained on the convergence time using techniques from the theory of Markov chains and show that simple models of mobility can dramatically accelerate gossip as long as the mobility paths significantly overlap. Simulations verify that different mobility patterns can have significantly different effects on the convergence of distributed algorithms. △ Less

Submitted 21 June, 2011; v1 submitted 14 October, 2008; originally announced October 2008.

Comments: Revised version submitted to IEEE Transactions on Information Theory

arXiv:0711.3926 [pdf, ps, other]

Rateless codes for AVC models

Authors: Anand D. Sarwate, Michael Gastpar

Abstract: The arbitrarily varying channel (AVC) is a channel model whose state is selected maliciously by an adversary. Fixed-blocklength coding assumes a worst-case bound on the adversary's capabilities, which leads to pessimistic results. This paper defines a variable-length perspective on this problem, for which achievable rates are shown that depend on the realized actions of the adversary. Specifical… ▽ More The arbitrarily varying channel (AVC) is a channel model whose state is selected maliciously by an adversary. Fixed-blocklength coding assumes a worst-case bound on the adversary's capabilities, which leads to pessimistic results. This paper defines a variable-length perspective on this problem, for which achievable rates are shown that depend on the realized actions of the adversary. Specifically, rateless codes are constructed which require a limited amount of common randomness. These codes are constructed for two kinds of AVC models. In the first the channel state cannot depend on the channel input, and in the second it can. As a byproduct, the randomized coding capacity of the AVC with state depending on the transmitted codeword is found and shown to be achievable with a small amount of common randomness. The results for this model are proved using a randomized strategy based on list decoding. △ Less

Submitted 5 October, 2009; v1 submitted 25 November, 2007; originally announced November 2007.

Comments: 14 pages, double column, extended version of paper to appear in the IEEE Transactions on Information Theory

arXiv:0711.0237 [pdf, ps, other]

doi 10.1109/TIT.2009.2034779

Zero-rate feedback can achieve the empirical capacity

Authors: Krishnan Eswaran, Anand D. Sarwate, Anant Sahai, Michael Gastpar

Abstract: The utility of limited feedback for coding over an individual sequence of DMCs is investigated. This study complements recent results showing how limited or noisy feedback can boost the reliability of communication. A strategy with fixed input distribution $P$ is given that asymptotically achieves rates arbitrarily close to the mutual information induced by $P$ and the state-averaged channel. Wh… ▽ More The utility of limited feedback for coding over an individual sequence of DMCs is investigated. This study complements recent results showing how limited or noisy feedback can boost the reliability of communication. A strategy with fixed input distribution $P$ is given that asymptotically achieves rates arbitrarily close to the mutual information induced by $P$ and the state-averaged channel. When the capacity achieving input distribution is the same over all channel states, this achieves rates at least as large as the capacity of the state averaged channel, sometimes called the empirical capacity. △ Less

Submitted 10 August, 2009; v1 submitted 1 November, 2007; originally announced November 2007.

Comments: Revised version of paper originally submitted to IEEE Transactions on Information Theory, Nov. 2007. This version contains further revisions and clarifications

arXiv:0709.3921 [pdf, ps, other]

doi 10.1109/TSP.2007.908946

Geographic Gossip: Efficient Averaging for Sensor Networks

Authors: Alexandros G. Dimakis, Anand D. Sarwate, Martin J. Wainwright

Abstract: Gossip algorithms for distributed computation are attractive due to their simplicity, distributed nature, and robustness in noisy and uncertain environments. However, using standard gossip algorithms can lead to a significant waste in energy by repeatedly recirculating redundant information. For realistic sensor network model topologies like grids and random geometric graphs, the inefficiency of… ▽ More Gossip algorithms for distributed computation are attractive due to their simplicity, distributed nature, and robustness in noisy and uncertain environments. However, using standard gossip algorithms can lead to a significant waste in energy by repeatedly recirculating redundant information. For realistic sensor network model topologies like grids and random geometric graphs, the inefficiency of gossip schemes is related to the slow mixing times of random walks on the communication graph. We propose and analyze an alternative gossiping scheme that exploits geographic information. By utilizing geographic routing combined with a simple resampling method, we demonstrate substantial gains over previously proposed gossip protocols. For regular graphs such as the ring or grid, our algorithm improves standard gossip by factors of $n$ and $\sqrt{n}$ respectively. For the more challenging case of random geometric graphs, our algorithm computes the true average to accuracy $ε$ using $O(\frac{n^{1.5}}{\sqrt{\log n}} \log ε^{-1})$ radio transmissions, which yields a $\sqrt{\frac{n}{\log n}}$ factor improvement over standard gossip algorithms. We illustrate these theoretical results with experimental comparisons between our algorithm and standard methods as applied to various classes of random fields. △ Less

Submitted 25 September, 2007; originally announced September 2007.

Comments: To appear, IEEE Transactions on Signal Processing

arXiv:cs/0701146 [pdf, ps, other]

State constraints and list decoding for the AVC

Authors: Anand D. Sarwate, Michael Gastpar

Abstract: List decoding for arbitrarily varying channels (AVCs) under state constraints is investigated. It is shown that rates within $ε$ of the randomized coding capacity of AVCs with input-dependent state can be achieved under maximal error with list decoding using lists of size $O(1/ε)$. Under average error an achievable rate region and converse bound are given for lists of size $L$. These bounds are… ▽ More List decoding for arbitrarily varying channels (AVCs) under state constraints is investigated. It is shown that rates within $ε$ of the randomized coding capacity of AVCs with input-dependent state can be achieved under maximal error with list decoding using lists of size $O(1/ε)$. Under average error an achievable rate region and converse bound are given for lists of size $L$. These bounds are based on two different notions of symmetrizability and do not coincide in general. An example is given that shows that for list size $L$ the capacity may be positive but strictly smaller than the randomized coding capacity. This behavior is different than the situation without state constraints. △ Less

Submitted 5 October, 2009; v1 submitted 23 January, 2007; originally announced January 2007.

Comments: 22 pages, significantly changed version submitted to IEEE Transactions on Information Theory

arXiv:cs/0602071 [pdf, ps, other]

Geographic Gossip: Efficient Aggregation for Sensor Networks

Authors: Alexandros G. Dimakis, Anand D. Sarwate, Martin J. Wainwright

Abstract: Gossip algorithms for aggregation have recently received significant attention for sensor network applications because of their simplicity and robustness in noisy and uncertain environments. However, gossip algorithms can waste significant energy by essentially passing around redundant information multiple times. For realistic sensor network model topologies like grids and random geometric graph… ▽ More Gossip algorithms for aggregation have recently received significant attention for sensor network applications because of their simplicity and robustness in noisy and uncertain environments. However, gossip algorithms can waste significant energy by essentially passing around redundant information multiple times. For realistic sensor network model topologies like grids and random geometric graphs, the inefficiency of gossip schemes is caused by slow mixing times of random walks on those graphs. We propose and analyze an alternative gossiping scheme that exploits geographic information. By utilizing a simple resampling method, we can demonstrate substantial gains over previously proposed gossip protocols. In particular, for random geometric graphs, our algorithm computes the true average to accuracy $1/n^a$ using $O(n^{1.5}\sqrt{\log n})$ radio transmissions, which reduces the energy consumption by a $\sqrt{\frac{n}{\log n}}$ factor over standard gossip algorithms. △ Less

Submitted 19 February, 2006; originally announced February 2006.

Comments: 8 pages total; to appear in Information Processing in Sensor Networks (IPSN) 2006

Showing 1–45 of 45 results for author: Sarwate, A D