Search | arXiv e-print repository

Randomized Algorithms for Symmetric Nonnegative Matrix Factorization

Authors: Koby Hayashi, Sinan G. Aksoy, Grey Ballard, Haesun Park

Abstract: Symmetric Nonnegative Matrix Factorization (SymNMF) is a technique in data analysis and machine learning that approximates a symmetric matrix with a product of a nonnegative, low-rank matrix and its transpose. To design faster and more scalable algorithms for SymNMF we develop two randomized algorithms for its computation. The first algorithm uses randomized matrix sketching to compute an initial… ▽ More Symmetric Nonnegative Matrix Factorization (SymNMF) is a technique in data analysis and machine learning that approximates a symmetric matrix with a product of a nonnegative, low-rank matrix and its transpose. To design faster and more scalable algorithms for SymNMF we develop two randomized algorithms for its computation. The first algorithm uses randomized matrix sketching to compute an initial low-rank input matrix and proceeds to use this input to rapidly compute a SymNMF. The second algorithm uses randomized leverage score sampling to approximately solve constrained least squares problems. Many successful methods for SymNMF rely on (approximately) solving sequences of constrained least squares problems. We prove theoretically that leverage score sampling can approximately solve nonnegative least squares problems to a chosen accuracy with high probability. Finally we demonstrate that both methods work well in practice by applying them to graph clustering tasks on large real world data sets. These experiments show that our methods approximately maintain solution quality and achieve significant speed ups for both large dense and large sparse problems. △ Less

Submitted 12 February, 2024; originally announced February 2024.

MSC Class: 65F55; 65F20

arXiv:2307.16652 [pdf, other]

Sequential and Shared-Memory Parallel Algorithms for Partitioned Local Depths

Authors: Aditya Devarakonda, Grey Ballard

Abstract: In this work, we design, analyze, and optimize sequential and shared-memory parallel algorithms for partitioned local depths (PaLD). Given a set of data points and pairwise distances, PaLD is a method for identifying strength of pairwise relationships based on relative distances, enabling the identification of strong ties within dense and sparse communities even if their sizes and within-community… ▽ More In this work, we design, analyze, and optimize sequential and shared-memory parallel algorithms for partitioned local depths (PaLD). Given a set of data points and pairwise distances, PaLD is a method for identifying strength of pairwise relationships based on relative distances, enabling the identification of strong ties within dense and sparse communities even if their sizes and within-community absolute distances vary greatly. We design two algorithmic variants that perform community structure analysis through triplet comparisons of pairwise distances. We present theoretical analyses of computation and communication costs and prove that the sequential algorithms are communication optimal, up to constant factors. We introduce performance optimization strategies that yield sequential speedups of up to $29\times$ over a baseline sequential implementation and parallel speedups of up to $19.4\times$ over optimized sequential implementations using up to $32$ threads on an Intel multicore CPU. △ Less

Submitted 31 July, 2023; originally announced July 2023.

MSC Class: 68W10 ACM Class: D.1.3

arXiv:2207.10437 [pdf, other]

Communication Lower Bounds and Optimal Algorithms for Multiple Tensor-Times-Matrix Computation

Authors: Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, Kathryn Rouse

Abstract: Multiple Tensor-Times-Matrix (Multi-TTM) is a key computation in algorithms for computing and operating with the Tucker tensor decomposition, which is frequently used in multidimensional data analysis. We establish communication lower bounds that determine how much data movement is required to perform the Multi-TTM computation in parallel. The crux of the proof relies on analytically solving a con… ▽ More Multiple Tensor-Times-Matrix (Multi-TTM) is a key computation in algorithms for computing and operating with the Tucker tensor decomposition, which is frequently used in multidimensional data analysis. We establish communication lower bounds that determine how much data movement is required to perform the Multi-TTM computation in parallel. The crux of the proof relies on analytically solving a constrained, nonlinear optimization problem. We also present a parallel algorithm to perform this computation that organizes the processors into a logical grid with twice as many modes as the input tensor. We show that with correct choices of grid dimensions, the communication cost of the algorithm attains the lower bounds and is therefore communication optimal. Finally, we show that our algorithm can significantly reduce communication compared to the straightforward approach of expressing the computation as a sequence of tensor-times-matrix operations. △ Less

Submitted 2 February, 2023; v1 submitted 21 July, 2022; originally announced July 2022.

arXiv:2205.13407 [pdf, ps, other]

Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds

Authors: Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, Kathryn Rouse

Abstract: Communication lower bounds have long been established for matrix multiplication algorithms. However, most methods of asymptotic analysis have either ignored the constant factors or not obtained the tightest possible values. Recent work has demonstrated that more careful analysis improves the best known constants for some classical matrix multiplication lower bounds and helps to identify more effic… ▽ More Communication lower bounds have long been established for matrix multiplication algorithms. However, most methods of asymptotic analysis have either ignored the constant factors or not obtained the tightest possible values. Recent work has demonstrated that more careful analysis improves the best known constants for some classical matrix multiplication lower bounds and helps to identify more efficient algorithms that match the leading-order terms in the lower bounds exactly and improve practical performance. The main result of this work is the establishment of memory-independent communication lower bounds with tight constants for parallel matrix multiplication. Our constants improve on previous work in each of three cases that depend on the relative sizes of the aspect ratios of the matrices. △ Less

Submitted 26 May, 2022; originally announced May 2022.

arXiv:1909.06524 [pdf, ps, other]

A Generalized Randomized Rank-Revealing Factorization

Authors: Grey Ballard, James Demmel, Ioana Dumitriu, Alexander Rusciano

Abstract: We introduce a Generalized Randomized QR-decomposition that may be applied to arbitrary products of matrices and their inverses, without needing to explicitly compute the products or inverses. This factorization is a critical part of a communication-optimal spectral divide-and-conquer algorithm for the nonsymmetric eigenvalue problem. In this paper, we establish that this randomized QR-factorizati… ▽ More We introduce a Generalized Randomized QR-decomposition that may be applied to arbitrary products of matrices and their inverses, without needing to explicitly compute the products or inverses. This factorization is a critical part of a communication-optimal spectral divide-and-conquer algorithm for the nonsymmetric eigenvalue problem. In this paper, we establish that this randomized QR-factorization satisfies the strong rank-revealing properties. We also formally prove its stability, making it suitable in applications. Finally, we present numerical experiments which demonstrate that our theoretical bounds capture the empirical behavior of the factorization. △ Less

Submitted 13 September, 2019; originally announced September 2019.

MSC Class: 15-04

arXiv:1909.01149 [pdf, other]

PLANC: Parallel Low Rank Approximation with Non-negativity Constraints

Authors: Srinivas Eswar, Koby Hayashi, Grey Ballard, Ramakrishnan Kannan, Michael A. Matheson, Haesun Park

Abstract: We consider the problem of low-rank approximation of massive dense non-negative tensor data, for example to discover latent patterns in video and imaging applications. As the size of data sets grows, single workstations are hitting bottlenecks in both computation time and available memory. We propose a distributed-memory parallel computing solution to handle massive data sets, loading the input da… ▽ More We consider the problem of low-rank approximation of massive dense non-negative tensor data, for example to discover latent patterns in video and imaging applications. As the size of data sets grows, single workstations are hitting bottlenecks in both computation time and available memory. We propose a distributed-memory parallel computing solution to handle massive data sets, loading the input data across the memories of multiple nodes and performing efficient and scalable parallel algorithms to compute the low-rank approximation. We present a software package called PLANC (Parallel Low Rank Approximation with Non-negativity Constraints), which implements our solution and allows for extension in terms of data (dense or sparse, matrices or tensors of any order), algorithm (e.g., from multiplicative updating techniques to alternating direction method of multipliers), and architecture (we exploit GPUs to accelerate the computation in this work).We describe our parallel distributions and algorithms, which are careful to avoid unnecessary communication and computation, show how to extend the software to include new algorithms and/or constraints, and report efficiency and scalability results for both synthetic and real-world data sets. △ Less

Submitted 30 August, 2019; originally announced September 2019.

Comments: arXiv admin note: text overlap with arXiv:1806.07985

arXiv:1906.04749 [pdf, other]

doi 10.1364/AO.58.008598

Joint 3D Localization and Classification of Space Debris using a Multispectral Rotating Point Spread Function

Authors: Chao Wang, Grey Ballard, Robert Plemmons, Sudhakar Prasad

Abstract: We consider the problem of joint three-dimensional (3D) localization and material classification of unresolved space debris using a multispectral rotating point spread function (RPSF). The use of RPSF allows one to estimate the 3D locations of point sources from their rotated images acquired by a single 2D sensor array, since the amount of rotation of each source image about its x, y location depe… ▽ More We consider the problem of joint three-dimensional (3D) localization and material classification of unresolved space debris using a multispectral rotating point spread function (RPSF). The use of RPSF allows one to estimate the 3D locations of point sources from their rotated images acquired by a single 2D sensor array, since the amount of rotation of each source image about its x, y location depends on its axial distance z. Using multi-spectral images, with one RPSF per spectral band, we are able not only to localize the 3D positions of the space debris but also classify their material composition. We propose a three-stage method for achieving joint localization and classification. In Stage 1, we adopt an optimization scheme for localization in which the spectral signature of each material is assumed to be uniform, which significantly improves efficiency and yields better localization results than possible with a single spectral band. In Stage 2, we estimate the spectral signature and refine the localization result via an alternating approach. We process classification in the final stage. Both Poisson noise and Gaussian noise models are considered, and the implementation of each is discussed. Numerical tests using multispectral data from NASA show the efficiency of our three-stage approach and illustrate the improvement of point source localization and spectral classification from using multiple bands over a single band. △ Less

Submitted 11 June, 2019; originally announced June 2019.

Comments: 25 pages

arXiv:1901.06043 [pdf, other]

doi 10.1145/3378445

TuckerMPI: A Parallel C++/MPI Software Package for Large-scale Data Compression via the Tucker Tensor Decomposition

Authors: Grey Ballard, Alicia Klinvex, Tamara G. Kolda

Abstract: Our goal is compression of massive-scale grid-structured data, such as the multi-terabyte output of a high-fidelity computational simulation. For such data sets, we have developed a new software package called TuckerMPI, a parallel C++/MPI software package for compressing distributed data. The approach is based on treating the data as a tensor, i.e., a multidimensional array, and computing its tru… ▽ More Our goal is compression of massive-scale grid-structured data, such as the multi-terabyte output of a high-fidelity computational simulation. For such data sets, we have developed a new software package called TuckerMPI, a parallel C++/MPI software package for compressing distributed data. The approach is based on treating the data as a tensor, i.e., a multidimensional array, and computing its truncated Tucker decomposition, a higher-order analogue to the truncated singular value decomposition of a matrix. The result is a low-rank approximation of the original tensor-structured data. Compression efficiency is achieved by detecting latent global structure within the data, which we contrast to most compression methods that are focused on local structure. In this work, we describe TuckerMPI, our implementation of the truncated Tucker decomposition, including details of the data distribution and in-memory layouts, the parallel and serial implementations of the key kernels, and analysis of the storage, communication, and computational costs. We test the software on 4.5 terabyte and 6.7 terabyte data sets distributed across 100s of nodes (1000s of MPI processes), achieving compression rates between 100-200,000$\times$ which equates to 99-99.999% compression (depending on the desired accuracy) in substantially less time than it would take to even read the same dataset from a parallel filesystem. Moreover, we show that our method also allows for reconstruction of partial or down-sampled data on a single node, without a parallel computer so long as the reconstructed portion is small enough to fit on a single machine, e.g., in the instance of reconstructing/visualizing a single down-sampled time step or computing summary statistics. △ Less

Submitted 21 August, 2019; v1 submitted 17 January, 2019; originally announced January 2019.

Journal ref: ACM Transactions on Mathematical Software, Vol. 46, No. 2, Article 13, June 2020

arXiv:1806.07985 [pdf, other]

Parallel Nonnegative CP Decomposition of Dense Tensors

Authors: Grey Ballard, Koby Hayashi, Ramakrishnan Kannan

Abstract: The CP tensor decomposition is a low-rank approximation of a tensor. We present a distributed-memory parallel algorithm and implementation of an alternating optimization method for computing a CP decomposition of dense tensor data that can enforce nonnegativity of the computed low-rank factors. The principal task is to parallelize the matricized-tensor times Khatri-Rao product (MTTKRP) bottleneck… ▽ More The CP tensor decomposition is a low-rank approximation of a tensor. We present a distributed-memory parallel algorithm and implementation of an alternating optimization method for computing a CP decomposition of dense tensor data that can enforce nonnegativity of the computed low-rank factors. The principal task is to parallelize the matricized-tensor times Khatri-Rao product (MTTKRP) bottleneck subcomputation. The algorithm is computation efficient, using dimension trees to avoid redundant computation across MTTKRPs within the alternating method. Our approach is also communication efficient, using a data distribution and parallel algorithm across a multidimensional processor grid that can be tuned to minimize communication. We benchmark our software on synthetic as well as hyperspectral image and neuroscience dynamic functional connectivity data, demonstrating that our algorithm scales well to 100s of nodes (up to 4096 cores) and is faster and more general than the currently available parallel software. △ Less

Submitted 19 June, 2018; originally announced June 2018.

arXiv:1805.05278 [pdf, ps, other]

A 3D Parallel Algorithm for QR Decomposition

Authors: Grey Ballard, James Demmel, Laura Grigori, Mathias Jacquelin, Nicholas Knight

Abstract: Interprocessor communication often dominates the runtime of large matrix computations. We present a parallel algorithm for computing QR decompositions whose bandwidth cost (communication volume) can be decreased at the cost of increasing its latency cost (number of messages). By varying a parameter to navigate the bandwidth/latency tradeoff, we can tune this algorithm for machines with different c… ▽ More Interprocessor communication often dominates the runtime of large matrix computations. We present a parallel algorithm for computing QR decompositions whose bandwidth cost (communication volume) can be decreased at the cost of increasing its latency cost (number of messages). By varying a parameter to navigate the bandwidth/latency tradeoff, we can tune this algorithm for machines with different communication costs. △ Less

Submitted 14 May, 2018; originally announced May 2018.

arXiv:1801.00843 [pdf, ps, other]

The geometry of rank decompositions of matrix multiplication II: $3\times 3$ matrices

Authors: Grey Ballard, Christian Ikenmeyer, J. M. Landsberg, Nick Ryder

Abstract: This is the second in a series of papers on rank decompositions of the matrix multiplication tensor. We present new rank $23$ decompositions for the $3\times 3$ matrix multiplication tensor $M_{\langle 3\rangle}$. All our decompositions have symmetry groups that include the standard cyclic permutation of factors but otherwise exhibit a range of behavior. One of them has 11 cubes as summands and ad… ▽ More This is the second in a series of papers on rank decompositions of the matrix multiplication tensor. We present new rank $23$ decompositions for the $3\times 3$ matrix multiplication tensor $M_{\langle 3\rangle}$. All our decompositions have symmetry groups that include the standard cyclic permutation of factors but otherwise exhibit a range of behavior. One of them has 11 cubes as summands and admits an unexpected symmetry group of order 12. We establish basic information regarding symmetry groups of decompositions and outline two approaches for finding new rank decompositions of $M_{\langle n\rangle}$ for larger $n$. △ Less

Submitted 2 January, 2018; originally announced January 2018.

MSC Class: 68Q17; 14L30; 15A69

arXiv:1708.08976 [pdf, other]

Shared Memory Parallelization of MTTKRP for Dense Tensors

Authors: Koby Hayashi, Grey Ballard, Jeffrey Jiang, Michael Tobia

Abstract: The matricized-tensor times Khatri-Rao product (MTTKRP) is the computational bottleneck for algorithms computing CP decompositions of tensors. In this paper, we develop shared-memory parallel algorithms for MTTKRP involving dense tensors. The algorithms cast nearly all of the computation as matrix operations in order to use optimized BLAS subroutines, and they avoid reordering tensor entries in me… ▽ More The matricized-tensor times Khatri-Rao product (MTTKRP) is the computational bottleneck for algorithms computing CP decompositions of tensors. In this paper, we develop shared-memory parallel algorithms for MTTKRP involving dense tensors. The algorithms cast nearly all of the computation as matrix operations in order to use optimized BLAS subroutines, and they avoid reordering tensor entries in memory. We benchmark sequential and parallel performance of our implementations, demonstrating high sequential performance and efficient parallel scaling. We use our parallel implementation to compute a CP decomposition of a neuroimaging data set and achieve a speedup of up to $7.4\times$ over existing parallel software. △ Less

Submitted 29 August, 2017; originally announced August 2017.

Comments: 10 pages, 27 figures

arXiv:1708.07401 [pdf, ps, other]

Communication Lower Bounds for Matricized Tensor Times Khatri-Rao Product

Authors: Grey Ballard, Nicholas Knight, Kathryn Rouse

Abstract: The matricized-tensor times Khatri-Rao product computation is the typical bottleneck in algorithms for computing a CP decomposition of a tensor. In order to develop high performance sequential and parallel algorithms, we establish communication lower bounds that identify how much data movement is required for this computation in the case of dense tensors. We also present sequential and parallel al… ▽ More The matricized-tensor times Khatri-Rao product computation is the typical bottleneck in algorithms for computing a CP decomposition of a tensor. In order to develop high performance sequential and parallel algorithms, we establish communication lower bounds that identify how much data movement is required for this computation in the case of dense tensors. We also present sequential and parallel algorithms that attain the lower bounds and are therefore communication optimal. In particular, we show that the structure of the computation allows for less communication than the straightforward approach of casting the computation as a matrix multiplication operation. △ Less

Submitted 22 October, 2017; v1 submitted 24 August, 2017; originally announced August 2017.

arXiv:1609.09154 [pdf, other]

MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization

Authors: Ramakrishnan Kannan, Grey Ballard, Haesun Park

Abstract: Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors $W$ and $H$, for the given input matrix $A$, such that $A \approx W H$. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data m… ▽ More Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors $W$ and $H$, for the given input matrix $A$, such that $A \approx W H$. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel algorithms to solve the problem for big data sets. The main contribution of this work is a new, high-performance parallel computational framework for a broad class of NMF algorithms that iteratively solves alternating non-negative least squares (NLS) subproblems for $W$ and $H$. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). The framework is flexible and able to leverage a variety of NMF and NLS algorithms, including Multiplicative Update, Hierarchical Alternating Least Squares, and Block Principal Pivoting. Our implementation allows us to benchmark and compare different algorithms on massive dense and sparse data matrices of size that spans for few hundreds of millions to billions. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements. The code and the datasets used for conducting the experiments are available online. △ Less

Submitted 28 September, 2016; originally announced September 2016.

Comments: arXiv admin note: text overlap with arXiv:1509.09313

arXiv:1604.03703 [pdf, other]

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

Authors: Edgar Solomonik, Grey Ballard, James Demmel, Torsten Hoefler

Abstract: Many large-scale scientific computations require eigenvalue solvers in a scaling regime where efficiency is limited by data movement. We introduce a parallel algorithm for computing the eigenvalues of a dense symmetric matrix, which performs asymptotically less communication than previously known approaches. We provide analysis in the Bulk Synchronous Parallel (BSP) model with additional considera… ▽ More Many large-scale scientific computations require eigenvalue solvers in a scaling regime where efficiency is limited by data movement. We introduce a parallel algorithm for computing the eigenvalues of a dense symmetric matrix, which performs asymptotically less communication than previously known approaches. We provide analysis in the Bulk Synchronous Parallel (BSP) model with additional consideration for communication between a local memory and cache. Given sufficient memory to store $c$ copies of the symmetric matrix, our algorithm requires $Θ(\sqrt{c})$ less interprocessor communication than previously known algorithms, for any $c\leq p^{1/3}$ when using $p$ processors. The algorithm first reduces the dense symmetric matrix to a banded matrix with the same eigenvalues. Subsequently, the algorithm employs successive reduction to $O(\log p)$ thinner banded matrices. We employ two new parallel algorithms that achieve lower communication costs for the full-to-band and band-to-band reductions. Both of these algorithms leverage a novel QR factorization algorithm for rectangular matrices. △ Less

Submitted 13 April, 2016; originally announced April 2016.

arXiv:1603.05627 [pdf, ps, other]

Hypergraph Partitioning for Sparse Matrix-Matrix Multiplication

Authors: Grey Ballard, Alex Druinsky, Nicholas Knight, Oded Schwartz

Abstract: We propose a fine-grained hypergraph model for sparse matrix-matrix multiplication (SpGEMM), a key computational kernel in scientific computing and data analysis whose performance is often communication bound. This model correctly describes both the interprocessor communication volume along a critical path in a parallel computation and also the volume of data moving through the memory hierarchy in… ▽ More We propose a fine-grained hypergraph model for sparse matrix-matrix multiplication (SpGEMM), a key computational kernel in scientific computing and data analysis whose performance is often communication bound. This model correctly describes both the interprocessor communication volume along a critical path in a parallel computation and also the volume of data moving through the memory hierarchy in a sequential computation. We show that identifying a communication-optimal algorithm for particular input matrices is equivalent to solving a hypergraph partitioning problem. Our approach is sparsity dependent, meaning that we seek the best algorithm for the given input matrices. In addition to our (3D) fine-grained model, we also propose coarse-grained 1D and 2D models that correspond to simpler SpGEMM algorithms. We explore the relations between our models theoretically, and we study their performance experimentally in the context of three applications that use SpGEMM as a key computation. For each application, we find that at least one coarse-grained model is as communication efficient as the fine-grained model. We also observe that different applications have affinities for different algorithms. Our results demonstrate that hypergraphs are an accurate model for reasoning about the communication costs of SpGEMM as well as a practical tool for exploring the SpGEMM algorithm design space. △ Less

Submitted 17 March, 2016; originally announced March 2016.

arXiv:1510.06689 [pdf, ps, other]

doi 10.1109/IPDPS.2016.67

Parallel Tensor Compression for Large-Scale Scientific Data

Authors: Woody Austin, Grey Ballard, Tamara G. Kolda

Abstract: As parallel computing trends towards the exascale, scientific data produced by high-fidelity simulations are growing increasingly massive. For instance, a simulation on a three-dimensional spatial grid with 512 points per dimension that tracks 64 variables per grid point for 128 time steps yields 8~TB of data, assuming double precision. By viewing the data as a dense five-way tensor, we can comput… ▽ More As parallel computing trends towards the exascale, scientific data produced by high-fidelity simulations are growing increasingly massive. For instance, a simulation on a three-dimensional spatial grid with 512 points per dimension that tracks 64 variables per grid point for 128 time steps yields 8~TB of data, assuming double precision. By viewing the data as a dense five-way tensor, we can compute a Tucker decomposition to find inherent low-dimensional multilinear structure, achieving compression ratios of up to 5000 on real-world data sets with negligible loss in accuracy. So that we can operate on such massive data, we present the first-ever distributed-memory parallel implementation for the Tucker decomposition, whose key computations correspond to parallel linear algebra operations, albeit with nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel. We provide accompanying analysis of the computation and communication costs of the algorithms. To demonstrate the compression and accuracy of the method, we apply our approach to real-world data sets from combustion science simulations. We also provide detailed performance results, including parallel performance in both weak and strong scaling experiments. △ Less

Submitted 23 February, 2016; v1 submitted 22 October, 2015; originally announced October 2015.

Journal ref: IPDPS'16: Proceedings of the 30th IEEE International Parallel and Distributed Processing Symposium, pp. 912-922, May 2016

arXiv:1510.00844 [pdf, other]

doi 10.1137/15M104253X

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

Authors: Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, Samuel Williams

Abstract: Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdos-Renyi matrices, th… ▽ More Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdos-Renyi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first ever implementation of the 3D SpGEMM formulation that also exploits multiple (intra-node and inter-node) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research. △ Less

Submitted 16 November, 2016; v1 submitted 3 October, 2015; originally announced October 2015.

Journal ref: SIAM Journal of Scientific Computing, Volume 38, Number 6, pp. C624-C651, 2016

arXiv:1509.09313 [pdf, other]

A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization

Authors: Ramakrishnan Kannan, Grey Ballard, Haesun Park

Abstract: Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors $W$ and $H$, for the given input matrix $A$, such that $A \approx W H$. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data m… ▽ More Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors $W$ and $H$, for the given input matrix $A$, such that $A \approx W H$. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel software to solve the problem for big datasets. Existing distributed-memory algorithms are limited in terms of performance and applicability, as they are implemented using Hadoop and are designed only for sparse matrices. We propose a distributed-memory parallel algorithm that computes the factorization by iteratively solving alternating non-negative least squares (NLS) subproblems for $W$ and $H$. To our knowledge, our algorithm is the first high-performance parallel algorithm for NMF. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). As opposed to previous implementations, our algorithm is also flexible: (1) it performs well for dense and sparse matrices, and (2) it allows the user to choose from among multiple algorithms for solving local NLS subproblems within the alternating iterations. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements. △ Less

Submitted 30 September, 2015; originally announced September 2015.

arXiv:1506.03872 [pdf, other]

doi 10.1109/ICDM.2015.46

Diamond Sampling for Approximate Maximum All-pairs Dot-product (MAD) Search

Authors: Grey Ballard, Ali Pinar, Tamara G. Kolda, C. Seshadhri

Abstract: Given two sets of vectors, $A = \{{a_1}, \dots, {a_m}\}$ and $B=\{{b_1},\dots,{b_n}\}$, our problem is to find the top-$t$ dot products, i.e., the largest $|{a_i}\cdot{b_j}|$ among all possible pairs. This is a fundamental mathematical problem that appears in numerous data applications involving similarity search, link prediction, and collaborative filtering. We propose a sampling-based approach t… ▽ More Given two sets of vectors, $A = \{{a_1}, \dots, {a_m}\}$ and $B=\{{b_1},\dots,{b_n}\}$, our problem is to find the top-$t$ dot products, i.e., the largest $|{a_i}\cdot{b_j}|$ among all possible pairs. This is a fundamental mathematical problem that appears in numerous data applications involving similarity search, link prediction, and collaborative filtering. We propose a sampling-based approach that avoids direct computation of all $mn$ dot products. We select diamonds (i.e., four-cycles) from the weighted tripartite representation of $A$ and $B$. The probability of selecting a diamond corresponding to pair $(i,j)$ is proportional to $({a_i}\cdot{b_j})^2$, amplifying the focus on the largest-magnitude entries. Experimental results indicate that diamond sampling is orders of magnitude faster than direct computation and requires far fewer samples than any competing approach. We also apply diamond sampling to the special case of maximum inner product search, and get significantly better results than the state-of-the-art hashing methods. △ Less

Submitted 18 June, 2015; v1 submitted 11 June, 2015; originally announced June 2015.

Journal ref: ICDM 2015: Proceedings of the 2015 IEEE International Conference on Data Mining, pp. 11-20, November 2015

arXiv:1409.2908 [pdf, ps, other]

doi 10.1145/2858788.2688513

A Framework for Practical Parallel Fast Matrix Multiplication

Authors: Austin R. Benson, Grey Ballard

Abstract: Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's fast algorithm on modest problem sizes and shapes. Furthermore, we show that the best choice of fast algorithm depends not only on the size of the matr… ▽ More Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's fast algorithm on modest problem sizes and shapes. Furthermore, we show that the best choice of fast algorithm depends not only on the size of the matrices but also the shape. We develop a code generation tool to automatically implement multiple sequential and shared-memory parallel variants of each fast algorithm, including our novel parallelization scheme. This allows us to rapidly benchmark over 20 fast algorithms on several problem sizes. Furthermore, we discuss a number of practical implementation issues for these algorithms on shared-memory machines that can direct further research on making fast algorithms practical. △ Less

Submitted 9 September, 2014; originally announced September 2014.

ACM Class: G.4

Journal ref: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2015

arXiv:1209.2184 [pdf, other]

doi 10.1007/978-3-642-34862-4_2

Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication

Authors: Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, Oded Schwartz

Abstract: Graph expansion analysis of computational DAGs is useful for obtaining communication cost lower bounds where previous methods, such as geometric embedding, are not applicable. This has recently been demonstrated for Strassen's and Strassen-like fast square matrix multiplication algorithms. Here we extend the expansion analysis approach to fast algorithms for rectangular matrix multiplication, obta… ▽ More Graph expansion analysis of computational DAGs is useful for obtaining communication cost lower bounds where previous methods, such as geometric embedding, are not applicable. This has recently been demonstrated for Strassen's and Strassen-like fast square matrix multiplication algorithms. Here we extend the expansion analysis approach to fast algorithms for rectangular matrix multiplication, obtaining a new class of communication cost lower bounds. These apply, for example to the algorithms of Bini et al. (1979) and the algorithms of Hopcroft and Kerr (1971). Some of our bounds are proved to be optimal. △ Less

Submitted 10 September, 2012; originally announced September 2012.

Journal ref: Design and Analysis of Algorithms Volume 7659, 2012, pp 13-36

arXiv:1202.3177 [pdf, other]

Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds

Authors: Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, Oded Schwartz

Abstract: A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's fast matrix multiplication (Ba… ▽ More A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz, 2012). Both algorithms scale perfectly, but only up to some number of processors where the inter-processor communication no longer scales. We obtain a memory-independent communication cost lower bound on classical and Strassen-based distributed-memory matrix multiplication algorithms. These bounds imply that no classical or Strassen-based parallel matrix multiplication algorithm can strongly scale perfectly beyond the ranges already attained by the two parallel algorithms mentioned above. The memory-independent bounds and the strong scaling bounds generalize to other algorithms. △ Less

Submitted 14 February, 2012; originally announced February 2012.

Comments: 4 pages, 1 figure

MSC Class: 68W10; 68W40 ACM Class: F.2.1

arXiv:1202.3173 [pdf, other]

Communication-Optimal Parallel Algorithm for Strassen's Matrix Multiplication

Authors: Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, Oded Schwartz

Abstract: Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A criti… ▽ More Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA'11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range. Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n=94080, where the number of nodes ranges from 49 to 7203. Our parallelization approach generalizes to other fast matrix multiplication algorithms. △ Less

Submitted 14 February, 2012; originally announced February 2012.

Comments: 13 pages, 3 figures

MSC Class: 68W40; 68W10 ACM Class: F.2.1

arXiv:1109.1693 [pdf, ps, other]

doi 10.1145/1989493.1989495

Graph Expansion and Communication Costs of Fast Matrix Multiplication

Authors: Grey Ballard, James Demmel, Olga Holtz, Oded Schwartz

Abstract: The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain first lower bounds on their communication costs. In the sequential case, where the processor has a fast memory of size $M$, too small to… ▽ More The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain first lower bounds on their communication costs. In the sequential case, where the processor has a fast memory of size $M$, too small to store three $n$-by-$n$ matrices, the lower bound on the number of words moved between fast and slow memory is, for many of the matrix multiplication algorithms, $Ω((\frac{n}{\sqrt M})^{ω_0}\cdot M)$, where $ω_0$ is the exponent in the arithmetic count (e.g., $ω_0 = \lg 7$ for Strassen, and $ω_0 = 3$ for conventional matrix multiplication). With $p$ parallel processors, each with fast memory of size $M$, the lower bound is $p$ times smaller. These bounds are attainable both for sequential and for parallel algorithms and hence optimal. These bounds can also be attained by many fast algorithms in linear algebra (e.g., algorithms for LU, QR, and solving the Sylvester equation). △ Less

Submitted 8 September, 2011; originally announced September 2011.

Report number: UCB/EECS-2011-40 ACM Class: F.2.1

Journal ref: Proceedings of the 23rd annual symposium on parallelism in algorithms and architectures. ACM, 1-12. 2011 (a shorter conference version)

arXiv:1011.3077 [pdf, other]

Minimizing Communication for Eigenproblems and the Singular Value Decomposition

Authors: Grey Ballard, James Demmel, Ioana Dumitriu

Abstract: Algorithms have two costs: arithmetic and communication. The latter represents the cost of moving data, either between levels of a memory hierarchy, or between processors over a network. Communication often dominates arithmetic and represents a rapidly increasing proportion of the total cost, so we seek algorithms that minimize communication. In \cite{BDHS10} lower bounds were presented on the amo… ▽ More Algorithms have two costs: arithmetic and communication. The latter represents the cost of moving data, either between levels of a memory hierarchy, or between processors over a network. Communication often dominates arithmetic and represents a rapidly increasing proportion of the total cost, so we seek algorithms that minimize communication. In \cite{BDHS10} lower bounds were presented on the amount of communication required for essentially all $O(n^3)$-like algorithms for linear algebra, including eigenvalue problems and the SVD. Conventional algorithms, including those currently implemented in (Sca)LAPACK, perform asymptotically more communication than these lower bounds require. In this paper we present parallel and sequential eigenvalue algorithms (for pencils, nonsymmetric matrices, and symmetric matrices) and SVD algorithms that do attain these lower bounds, and analyze their convergence and communication costs. △ Less

Submitted 12 November, 2010; originally announced November 2010.

Comments: 43 pages, 11 figures

MSC Class: 65F15

arXiv:0905.2485 [pdf, ps, other]

doi 10.1137/090769156

Minimizing Communication in Linear Algebra

Authors: Grey Ballard, James Demmel, Olga Holtz, Oded Schwartz

Abstract: In 1981 Hong and Kung proved a lower bound on the amount of communication needed to perform dense, matrix-multiplication using the conventional $O(n^3)$ algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and extended it to the parallel case. In both cases the lower bound may be expressed as $Ω$(#ar… ▽ More In 1981 Hong and Kung proved a lower bound on the amount of communication needed to perform dense, matrix-multiplication using the conventional $O(n^3)$ algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and extended it to the parallel case. In both cases the lower bound may be expressed as $Ω$(#arithmetic operations / $\sqrt{M}$), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, $LDL^T$ factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain. △ Less

Submitted 15 May, 2009; originally announced May 2009.

Comments: 27 pages, 2 tables

Journal ref: SIAM. J. Matrix Anal. & Appl. 32 (2011), no. 3, 866-901

arXiv:0902.2537 [pdf, other]

doi 10.1137/090760969

Communication-optimal Parallel and Sequential Cholesky Decomposition

Authors: Grey Ballard, James Demmel, Olga Holtz, Oded Schwartz

Abstract: Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower… ▽ More Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n^3)) matrix multiplication to Cholesky factorization, which is used for solving dense symmetric positive definite linear systems. Second, we compare the costs of various Cholesky decomposition implementations to these lower bounds and identify the algorithms and data structures that attain them. In the sequential case, we consider both the two-level and hierarchical memory models. Combined with prior results in [13, 14, 15], this gives a set of communication-optimal algorithms for O(n^3) implementations of the three basic factorizations of dense linear algebra: LU with pivoting, QR and Cholesky. But it goes beyond this prior work on sequential LU by optimizing communication for any number of levels of memory hierarchy. △ Less

Submitted 12 April, 2010; v1 submitted 15 February, 2009; originally announced February 2009.

Comments: 29 pages, 2 tables, 6 figures

ACM Class: F.2.1

Journal ref: SIAM J. Sci. Comput. 32, (2010) pp. 3495-3523

Showing 1–28 of 28 results for author: Ballard, G