Search | arXiv e-print repository

A simple and provable algorithm for sparse diagonal CCA

Authors: Megasthenis Asteris, Anastasios Kyrillidis, Oluwasanmi Koyejo, Russell Poldrack

Abstract: Given two sets of variables, derived from a common set of samples, sparse Canonical Correlation Analysis (CCA) seeks linear combinations of a small number of variables in each set, such that the induced canonical variables are maximally correlated. Sparse CCA is NP-hard. We propose a novel combinatorial algorithm for sparse diagonal CCA, i.e., sparse CCA under the additional assumption that vari… ▽ More Given two sets of variables, derived from a common set of samples, sparse Canonical Correlation Analysis (CCA) seeks linear combinations of a small number of variables in each set, such that the induced canonical variables are maximally correlated. Sparse CCA is NP-hard. We propose a novel combinatorial algorithm for sparse diagonal CCA, i.e., sparse CCA under the additional assumption that variables within each set are standardized and uncorrelated. Our algorithm operates on a low rank approximation of the input data and its computational complexity scales linearly with the number of input variables. It is simple to implement, and parallelizable. In contrast to most existing approaches, our algorithm administers precise control on the sparsity of the extracted canonical vectors, and comes with theoretical data-dependent global approximation guarantees, that hinge on the spectrum of the input data. Finally, it can be straightforwardly adapted to other constrained variants of CCA enforcing structure beyond sparsity. We empirically evaluate the proposed scheme and apply it on a real neuroimaging dataset to investigate associations between brain activity and behavior measurements. △ Less

Submitted 28 May, 2016; originally announced May 2016.

Comments: To appear at ICML 2016, 14 pages, 4 figures

arXiv:1603.06861 [pdf, other]

Trading-off variance and complexity in stochastic gradient descent

Authors: Vatsal Shah, Megasthenis Asteris, Anastasios Kyrillidis, Sujay Sanghavi

Abstract: Stochastic gradient descent is the method of choice for large-scale machine learning problems, by virtue of its light complexity per iteration. However, it lags behind its non-stochastic counterparts with respect to the convergence rate, due to high variance introduced by the stochastic updates. The popular Stochastic Variance-Reduced Gradient (SVRG) method mitigates this shortcoming, introducing… ▽ More Stochastic gradient descent is the method of choice for large-scale machine learning problems, by virtue of its light complexity per iteration. However, it lags behind its non-stochastic counterparts with respect to the convergence rate, due to high variance introduced by the stochastic updates. The popular Stochastic Variance-Reduced Gradient (SVRG) method mitigates this shortcoming, introducing a new update rule which requires infrequent passes over the entire input dataset to compute the full-gradient. In this work, we propose CheapSVRG, a stochastic variance-reduction optimization scheme. Our algorithm is similar to SVRG but instead of the full gradient, it uses a surrogate which can be efficiently computed on a small subset of the input data. It achieves a linear convergence rate ---up to some error level, depending on the nature of the optimization problem---and features a trade-off between the computational complexity and the convergence rate. Empirical evaluation shows that CheapSVRG performs at least competitively compared to the state of the art. △ Less

Submitted 22 March, 2016; originally announced March 2016.

Comments: 14 pages, 13 figures, first edition on 9th of October 2015

arXiv:1603.02782 [pdf, other]

Bipartite Correlation Clustering -- Maximizing Agreements

Authors: Megasthenis Asteris, Anastasios Kyrillidis, Dimitris Papailiopoulos, Alexandros G. Dimakis

Abstract: In Bipartite Correlation Clustering (BCC) we are given a complete bipartite graph $G$ with `+' and `-' edges, and we seek a vertex clustering that maximizes the number of agreements: the number of all `+' edges within clusters plus all `-' edges cut across clusters. BCC is known to be NP-hard. We present a novel approximation algorithm for $k$-BCC, a variant of BCC with an upper bound $k$ on the… ▽ More In Bipartite Correlation Clustering (BCC) we are given a complete bipartite graph $G$ with `+' and `-' edges, and we seek a vertex clustering that maximizes the number of agreements: the number of all `+' edges within clusters plus all `-' edges cut across clusters. BCC is known to be NP-hard. We present a novel approximation algorithm for $k$-BCC, a variant of BCC with an upper bound $k$ on the number of clusters. Our algorithm outputs a $k$-clustering that provably achieves a number of agreements within a multiplicative ${(1-δ)}$-factor from the optimal, for any desired accuracy $δ$. It relies on solving a combinatorially constrained bilinear maximization on the bi-adjacency matrix of $G$. It runs in time exponential in $k$ and $δ^{-1}$, but linear in the size of the input. Further, we show that, in the (unconstrained) BCC setting, an ${(1-δ)}$-approximation can be achieved by $O(δ^{-1})$ clusters regardless of the size of the graph. In turn, our $k$-BCC algorithm implies an Efficient PTAS for the BCC objective of maximizing agreements. △ Less

Submitted 9 March, 2016; originally announced March 2016.

Comments: To appear in AISTATS 2016

arXiv:1508.00625 [pdf, ps, other]

Sparse PCA via Bipartite Matchings

Authors: Megasthenis Asteris, Dimitris Papailiopoulos, Anastasios Kyrillidis, Alexandros G. Dimakis

Abstract: We consider the following multi-component sparse PCA problem: given a set of data points, we seek to extract a small number of sparse components with disjoint supports that jointly capture the maximum possible variance. These components can be computed one by one, repeatedly solving the single-component problem and deflating the input data matrix, but as we show this greedy procedure is suboptimal… ▽ More We consider the following multi-component sparse PCA problem: given a set of data points, we seek to extract a small number of sparse components with disjoint supports that jointly capture the maximum possible variance. These components can be computed one by one, repeatedly solving the single-component problem and deflating the input data matrix, but as we show this greedy procedure is suboptimal. We present a novel algorithm for sparse PCA that jointly optimizes multiple disjoint components. The extracted features capture variance that lies within a multiplicative factor arbitrarily close to 1 from the optimal. Our algorithm is combinatorial and computes the desired components by solving multiple instances of the bipartite maximum weight matching problem. Its complexity grows as a low order polynomial in the ambient dimension of the input data matrix, but exponentially in its rank. However, it can be effectively applied on a low-dimensional sketch of the data; this allows us to obtain polynomial-time approximation guarantees via spectral bounds. We evaluate our algorithm on real data-sets and empirically demonstrate that in many cases it outperforms existing, deflation-based approaches. △ Less

Submitted 3 August, 2015; originally announced August 2015.

arXiv:1506.02344 [pdf, other]

Stay on path: PCA along graph paths

Authors: Megasthenis Asteris, Anastasios Kyrillidis, Alexandros G. Dimakis, Han-Gyol Yi and, Bharath Chandrasekaran

Abstract: We introduce a variant of (sparse) PCA in which the set of feasible support sets is determined by a graph. In particular, we consider the following setting: given a directed acyclic graph $G$ on $p$ vertices corresponding to variables, the non-zero entries of the extracted principal component must coincide with vertices lying along a path in $G$. From a statistical perspective, information on th… ▽ More We introduce a variant of (sparse) PCA in which the set of feasible support sets is determined by a graph. In particular, we consider the following setting: given a directed acyclic graph $G$ on $p$ vertices corresponding to variables, the non-zero entries of the extracted principal component must coincide with vertices lying along a path in $G$. From a statistical perspective, information on the underlying network may potentially reduce the number of observations required to recover the population principal component. We consider the canonical estimator which optimally exploits the prior knowledge by solving a non-convex quadratic maximization on the empirical covariance. We introduce a simple network and analyze the estimator under the spiked covariance model. We show that side information potentially improves the statistical complexity. We propose two algorithms to approximate the solution of the constrained quadratic maximization, and recover a component with the desired properties. We empirically evaluate our schemes on synthetic and real datasets. △ Less

Submitted 18 June, 2015; v1 submitted 7 June, 2015; originally announced June 2015.

Comments: 12 pages, 5 figures, In Proceedings of International Conference on Machine Learning (ICML) 2015

arXiv:1504.05294 [pdf, ps, other]

On Approximating the Sum-Rate for Multiple-Unicasts

Authors: Karthikeyan Shanmugam, Megasthenis Asteris, Alexandros G. Dimakis

Abstract: We study upper bounds on the sum-rate of multiple-unicasts. We approximate the Generalized Network Sharing Bound (GNS cut) of the multiple-unicasts network coding problem with $k$ independent sources. Our approximation algorithm runs in polynomial time and yields an upper bound on the joint source entropy rate, which is within an $O(\log^2 k)$ factor from the GNS cut. It further yields a vector-li… ▽ More We study upper bounds on the sum-rate of multiple-unicasts. We approximate the Generalized Network Sharing Bound (GNS cut) of the multiple-unicasts network coding problem with $k$ independent sources. Our approximation algorithm runs in polynomial time and yields an upper bound on the joint source entropy rate, which is within an $O(\log^2 k)$ factor from the GNS cut. It further yields a vector-linear network code that achieves joint source entropy rate within an $O(\log^2 k)$ factor from the GNS cut, but \emph{not} with independent sources: the code induces a correlation pattern among the sources. Our second contribution is establishing a separation result for vector-linear network codes: for any given field $\mathbb{F}$ there exist networks for which the optimum sum-rate supported by vector-linear codes over $\mathbb{F}$ for independent sources can be multiplicatively separated by a factor of $k^{1-δ}$, for any constant ${δ>0}$, from the optimum joint entropy rate supported by a code that allows correlation between sources. Finally, we establish a similar separation result for the asymmetric optimum vector-linear sum-rates achieved over two distinct fields $\mathbb{F}_{p}$ and $\mathbb{F}_{q}$ for independent sources, revealing that the choice of field can heavily impact the performance of a linear network code. △ Less

Submitted 15 November, 2015; v1 submitted 20 April, 2015; originally announced April 2015.

Comments: 10 pages; Shorter version appeared at ISIT (International Symposium on Information Theory) 2015; some typos corrected

arXiv:1401.0734 [pdf, other]

Repairable Fountain Codes

Authors: Megasthenis Asteris, Alexandros G. Dimakis

Abstract: We introduce a new family of Fountain codes that are systematic and also have sparse parities. Given an input of $k$ symbols, our codes produce an unbounded number of output symbols, generating each parity independently by linearly combining a logarithmic number of randomly selected input symbols. The construction guarantees that for any $ε>0$ accessing a random subset of $(1+ε)k$ encoded symbols,… ▽ More We introduce a new family of Fountain codes that are systematic and also have sparse parities. Given an input of $k$ symbols, our codes produce an unbounded number of output symbols, generating each parity independently by linearly combining a logarithmic number of randomly selected input symbols. The construction guarantees that for any $ε>0$ accessing a random subset of $(1+ε)k$ encoded symbols, asymptotically suffices to recover the $k$ input symbols with high probability. Our codes have the additional benefit of logarithmic locality: a single lost symbol can be repaired by accessing a subset of $O(\log k)$ of the remaining encoded symbols. This is a desired property for distributed storage systems where symbols are spread over a network of storage nodes. Beyond recovery upon loss, local reconstruction provides an efficient alternative for reading symbols that cannot be accessed directly. In our code, a logarithmic number of disjoint local groups is associated with each systematic symbol, allowing multiple parallel reads. Our main mathematical contribution involves analyzing the rank of sparse random matrices with specific structure over finite fields. We rely on establishing that a new family of sparse random bipartite graphs have perfect matchings with high probability. △ Less

Submitted 3 January, 2014; originally announced January 2014.

Comments: To appear in IEEE Journal on Selected Areas in Communications, Issue on Communication Methodologies for Next-Generation Storage Systems 2013, 11 pages, 2 figures

arXiv:1312.5891 [pdf, other]

The Sparse Principal Component of a Constant-rank Matrix

Authors: Megasthenis Asteris, Dimitris S. Papailiopoulos, George N. Karystinos

Abstract: The computation of the sparse principal component of a matrix is equivalent to the identification of its principal submatrix with the largest maximum eigenvalue. Finding this optimal submatrix is what renders the problem ${\mathcal{NP}}$-hard. In this work, we prove that, if the matrix is positive semidefinite and its rank is constant, then its sparse principal component is polynomially computable… ▽ More The computation of the sparse principal component of a matrix is equivalent to the identification of its principal submatrix with the largest maximum eigenvalue. Finding this optimal submatrix is what renders the problem ${\mathcal{NP}}$-hard. In this work, we prove that, if the matrix is positive semidefinite and its rank is constant, then its sparse principal component is polynomially computable. Our proof utilizes the auxiliary unit vector technique that has been recently developed to identify problems that are polynomially solvable. Moreover, we use this technique to design an algorithm which, for any sparsity value, computes the sparse principal component with complexity ${\mathcal O}\left(N^{D+1}\right)$, where $N$ and $D$ are the matrix size and rank, respectively. Our algorithm is fully parallelizable and memory efficient. △ Less

Submitted 20 December, 2013; originally announced December 2013.

arXiv:1301.3791 [pdf, other]

XORing Elephants: Novel Erasure Codes for Big Data

Authors: Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G. Dimakis, Ramkumar Vadali, Scott Chen, Dhruba Borthakur

Abstract: Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. Reed-Solomon codes are the standard design choice and their high repair cost is often considered an unavoidable price to pay for high storage efficiency and high reliability. This paper shows how… ▽ More Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. Reed-Solomon codes are the standard design choice and their high repair cost is often considered an unavoidable price to pay for high storage efficiency and high reliability. This paper shows how to overcome this limitation. We present a novel family of erasure codes that are efficiently repairable and offer higher reliability compared to Reed-Solomon codes. We show analytically that our codes are optimal on a recently identified tradeoff between locality and minimum distance. We implement our new codes in Hadoop HDFS and compare to a currently deployed HDFS module that uses Reed-Solomon codes. Our modified HDFS implementation shows a reduction of approximately 2x on the repair disk I/O and repair network traffic. The disadvantage of the new coding scheme is that it requires 14% more storage compared to Reed-Solomon codes, an overhead shown to be information theoretically optimal to obtain locality. Because the new codes repair failures faster, this provides higher reliability, which is orders of magnitude higher compared to replication. △ Less

Submitted 16 January, 2013; originally announced January 2013.

Comments: Technical report, paper to appear in Proceedings of VLDB, 2013

arXiv:1106.1651 [pdf, other]

Sparse Principal Component of a Rank-deficient Matrix

Authors: Megasthenis Asteris, Dimitris S. Papailiopoulos, George N. Karystinos

Abstract: We consider the problem of identifying the sparse principal component of a rank-deficient matrix. We introduce auxiliary spherical variables and prove that there exists a set of candidate index-sets (that is, sets of indices to the nonzero elements of the vector argument) whose size is polynomially bounded, in terms of rank, and contains the optimal index-set, i.e. the index-set of the nonzero ele… ▽ More We consider the problem of identifying the sparse principal component of a rank-deficient matrix. We introduce auxiliary spherical variables and prove that there exists a set of candidate index-sets (that is, sets of indices to the nonzero elements of the vector argument) whose size is polynomially bounded, in terms of rank, and contains the optimal index-set, i.e. the index-set of the nonzero elements of the optimal solution. Finally, we develop an algorithm that computes the optimal sparse principal component in polynomial time for any sparsity degree. △ Less

Submitted 8 June, 2011; originally announced June 2011.

Comments: 5 pages, 1 figure, to be presented at ISIT

Showing 1–10 of 10 results for author: Asteris, M