-
Provably-Efficient and Internally-Deterministic Parallel Union-Find
Authors:
Alexander Fedorov,
Diba Hashemi,
Giorgi Nadiradze,
Dan Alistarh
Abstract:
Determining the degree of inherent parallelism in classical sequential algorithms and leveraging it for fast parallel execution is a key topic in parallel computing, and detailed analyses are known for a wide range of classical algorithms. In this paper, we perform the first such analysis for the fundamental Union-Find problem, in which we are given a graph as a sequence of edges, and must maintai…
▽ More
Determining the degree of inherent parallelism in classical sequential algorithms and leveraging it for fast parallel execution is a key topic in parallel computing, and detailed analyses are known for a wide range of classical algorithms. In this paper, we perform the first such analysis for the fundamental Union-Find problem, in which we are given a graph as a sequence of edges, and must maintain its connectivity structure under edge additions. We prove that classic sequential algorithms for this problem are well-parallelizable under reasonable assumptions, addressing a conjecture by [Blelloch, 2017]. More precisely, we show via a new potential argument that, under uniform random edge ordering, parallel union-find operations are unlikely to interfere: $T$ concurrent threads processing the graph in parallel will encounter memory contention $O(T^2 \cdot \log |V| \cdot \log |E|)$ times in expectation, where $|E|$ and $|V|$ are the number of edges and nodes in the graph, respectively. We leverage this result to design a new parallel Union-Find algorithm that is both internally deterministic, i.e., its results are guaranteed to match those of a sequential execution, but also work-efficient and scalable, as long as the number of threads $T$ is $O(|E|^{\frac{1}{3} - \varepsilon})$, for an arbitrarily small constant $\varepsilon > 0$, which holds for most large real-world graphs. We present lower bounds which show that our analysis is close to optimal, and experimental results suggesting that the performance cost of internal determinism is limited.
△ Less
Submitted 18 April, 2023;
originally announced April 2023.
-
Hybrid Decentralized Optimization: Leveraging Both First- and Zeroth-Order Optimizers for Faster Convergence
Authors:
Matin Ansaripour,
Shayan Talaei,
Giorgi Nadiradze,
Dan Alistarh
Abstract:
Distributed optimization is the standard way of speeding up machine learning training, and most of the research in the area focuses on distributed first-order, gradient-based methods. Yet, there are settings where some computationally-bounded nodes may not be able to implement first-order, gradient-based optimization, while they could still contribute to joint optimization tasks. In this paper, we…
▽ More
Distributed optimization is the standard way of speeding up machine learning training, and most of the research in the area focuses on distributed first-order, gradient-based methods. Yet, there are settings where some computationally-bounded nodes may not be able to implement first-order, gradient-based optimization, while they could still contribute to joint optimization tasks. In this paper, we initiate the study of hybrid decentralized optimization, studying settings where nodes with zeroth-order and first-order optimization capabilities co-exist in a distributed system, and attempt to jointly solve an optimization task over some data distribution. We essentially show that, under reasonable parameter settings, such a system can not only withstand noisier zeroth-order agents but can even benefit from integrating such agents into the optimization process, rather than ignoring their information. At the core of our approach is a new analysis of distributed optimization with noisy and possibly-biased gradient estimators, which may be of independent interest. Our results hold for both convex and non-convex objectives. Experimental results on standard optimization tasks confirm our analysis, showing that hybrid first-zeroth order optimization can be practical, even when training deep neural networks.
△ Less
Submitted 4 September, 2024; v1 submitted 14 October, 2022;
originally announced October 2022.
-
Communication-Efficient Federated Learning With Data and Client Heterogeneity
Authors:
Hossein Zakerinia,
Shayan Talaei,
Giorgi Nadiradze,
Dan Alistarh
Abstract:
Federated Learning (FL) enables large-scale distributed training of machine learning models, while still allowing individual nodes to maintain data locally.
However, executing FL at scale comes with inherent practical challenges:
1) heterogeneity of the local node data distributions,
2) heterogeneity of node computational speeds (asynchrony),
but also 3) constraints in the amount of commun…
▽ More
Federated Learning (FL) enables large-scale distributed training of machine learning models, while still allowing individual nodes to maintain data locally.
However, executing FL at scale comes with inherent practical challenges:
1) heterogeneity of the local node data distributions,
2) heterogeneity of node computational speeds (asynchrony),
but also 3) constraints in the amount of communication between the clients and the server.
In this work, we present the first variant of the classic federated averaging (FedAvg) algorithm
which, at the same time, supports data heterogeneity, partial client asynchrony, and communication compression.
Our algorithm comes with a rigorous analysis showing that, in spite of these system relaxations,
it can provide similar convergence to FedAvg in interesting parameter regimes.
Experimental results in the rigorous LEAF benchmark on setups of up to $300$ nodes show that our algorithm ensures fast convergence for standard federated tasks, improving upon prior quantized and asynchronous approaches.
△ Less
Submitted 3 June, 2023; v1 submitted 20 June, 2022;
originally announced June 2022.
-
Multi-Queues Can Be State-of-the-Art Priority Schedulers
Authors:
Anastasiia Postnikova,
Nikita Koval,
Giorgi Nadiradze,
Dan Alistarh
Abstract:
Designing and implementing efficient parallel priority schedulers is an active research area. An intriguing proposed design is the Multi-Queue: given $n$ threads and $m\ge n$ distinct priority queues, task insertions are performed uniformly at random, while, to delete, a thread picks two queues uniformly at random, and removes the observed task of higher priority. This approach scales well, and ha…
▽ More
Designing and implementing efficient parallel priority schedulers is an active research area. An intriguing proposed design is the Multi-Queue: given $n$ threads and $m\ge n$ distinct priority queues, task insertions are performed uniformly at random, while, to delete, a thread picks two queues uniformly at random, and removes the observed task of higher priority. This approach scales well, and has probabilistic rank guarantees: roughly, the rank of each task removed, relative to remaining tasks in all other queues, is $O(m)$ in expectation. Yet, the performance of this pattern is below that of well-engineered schedulers, which eschew theoretical guarantees for practical efficiency.
We investigate whether it is possible to design and implement a Multi-Queue-based task scheduler that is both highly efficient and has analytical guarantees. We propose a new variant called the Stealing Multi-Queue (SMQ), a cache-efficient variant of the Multi-Queue, which leverages both queue affinity -- each thread has a local queue, from which tasks are usually removed; but, with some probability, threads also attempt to steal higher-priority tasks from the other queues -- and task batching, that is, the processing of several tasks in a single insert / delete step. These ideas are well-known for task scheduling without priorities; our theoretical contribution is showing that, despite relaxations, this design can still provide rank guarantees, which in turn implies bounds on total work performed. We provide a general SMQ implementation that can surpass state-of-the-art schedulers such as Galois and PMOD in terms of performance on popular graph-processing benchmarks. Notably, the performance improvement comes mainly from the superior rank guarantees provided by our scheduler, confirming that analytically-reasoned approaches can still provide performance improvements for priority task scheduling.
△ Less
Submitted 1 September, 2021;
originally announced September 2021.
-
Lower Bounds for Shared-Memory Leader Election under Bounded Write Contention
Authors:
Dan Alistarh,
Rati Gelashvili,
Giorgi Nadiradze
Abstract:
This paper gives tight logarithmic lower bounds on the solo step complexity of leader election in an asynchronous shared-memory model with single-writer multi-reader (SWMR) registers, for randomized obstruction-free algorithms.
The approach extends to lower bounds for randomized obstruction-free algorithms using multi-writer registers under bounded write concurrency, showing a trade-off between…
▽ More
This paper gives tight logarithmic lower bounds on the solo step complexity of leader election in an asynchronous shared-memory model with single-writer multi-reader (SWMR) registers, for randomized obstruction-free algorithms.
The approach extends to lower bounds for randomized obstruction-free algorithms using multi-writer registers under bounded write concurrency, showing a trade-off between the solo step complexity of a leader election algorithm, and the worst-case contention incurred by a processor in an execution.
△ Less
Submitted 25 March, 2022; v1 submitted 5 August, 2021;
originally announced August 2021.
-
Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging
Authors:
Shigang Li,
Tal Ben-Nun,
Giorgi Nadiradze,
Salvatore Di Girolamo,
Nikoli Dryden,
Dan Alistarh,
Torsten Hoefler
Abstract:
Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating…
▽ More
Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates similar to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput (e.g., 2.1x on 1,024 GPUs for reinforcement learning), and achieves the fastest time-to-solution (e.g., the highest score using the shortest training time for Transformer).
△ Less
Submitted 20 February, 2021; v1 submitted 30 April, 2020;
originally announced May 2020.
-
Efficiency Guarantees for Parallel Incremental Algorithms under Relaxed Schedulers
Authors:
Dan Alistarh,
Nikita Koval,
Giorgi Nadiradze
Abstract:
Several classic problems in graph processing and computational geometry are solved via incremental algorithms, which split computation into a series of small tasks acting on shared state, which gets updated progressively.
While the sequential variant of such algorithms usually specifies a fixed (but sometimes random) order in which the tasks should be performed, a standard approach to paralleliz…
▽ More
Several classic problems in graph processing and computational geometry are solved via incremental algorithms, which split computation into a series of small tasks acting on shared state, which gets updated progressively.
While the sequential variant of such algorithms usually specifies a fixed (but sometimes random) order in which the tasks should be performed, a standard approach to parallelizing such algorithms is to relax this constraint to allow for out-of-order parallel execution. This is the case for parallel implementations of Dijkstra's single-source shortest-paths algorithm (SSSP), and for parallel Delaunay mesh triangulation.
While many software frameworks parallelize incremental computation in this way, it is still not well understood whether this relaxed ordering approach can still provide any complexity guarantees.
In this paper, we address this problem, and analyze the efficiency guarantees provided by a range of incremental algorithms when parallelized via relaxed schedulers.
We show that, for algorithms such as Delaunay mesh triangulation and sorting by insertion, schedulers with a maximum relaxation factor of $k$ in terms of the maximum priority inversion allowed will introduce a maximum amount of wasted work of $O(log(n) poly (k) ), $ where $n$ is the number of tasks to be executed.
For SSSP, we show that the additional work is $O(poly (k) d_{max} / w_{min}), $ where $d_{\max}$ is the maximum distance between two nodes, and $w_{min}$ is the minimum such distance. In practical settings where $n \gg k$, this suggests that the overheads of relaxation will be outweighed by the improved scalability of the relaxed scheduler.
On the negative side, we provide lower bounds showing that certain algorithms will inherently incur a non-trivial amount of wasted work due to scheduler relaxation, even for relatively benign relaxed schedulers.
△ Less
Submitted 22 March, 2020; v1 submitted 20 March, 2020;
originally announced March 2020.
-
Dynamic Averaging Load Balancing on Cycles
Authors:
Dan Alistarh,
Giorgi Nadiradze,
Amirmojtaba Sabour
Abstract:
We consider the following dynamic load-balancing process: given an underlying graph $G$ with $n$ nodes, in each step $t\geq 0$, one unit of load is created, and placed at a randomly chosen graph node. In the same step, the chosen node picks a random neighbor, and the two nodes balance their loads by averaging them. We are interested in the expected gap between the minimum and maximum loads at node…
▽ More
We consider the following dynamic load-balancing process: given an underlying graph $G$ with $n$ nodes, in each step $t\geq 0$, one unit of load is created, and placed at a randomly chosen graph node. In the same step, the chosen node picks a random neighbor, and the two nodes balance their loads by averaging them. We are interested in the expected gap between the minimum and maximum loads at nodes as the process progresses, and its dependence on $n$ and on the graph structure.
Similar variants of the above graphical balanced allocation process have been studied by Peres, Talwar, and Wieder, and by Sauerwald and Sun for regular graphs. These authors left as open the question of characterizing the gap in the case of \emph{cycle graphs} in the \emph{dynamic} case, where weights are created during the algorithm's execution. For this case, the only known upper bound is of $\mathcal{O}( n \log n )$, following from a majorization argument due to Peres, Talwar, and Wieder, which analyzes a related graphical allocation process.
In this paper, we provide an upper bound of $\mathcal{O} ( \sqrt n \log n )$ on the expected gap of the above process for cycles of length $n$. We introduce a new potential analysis technique, which enables us to bound the difference in load between $k$-hop neighbors on the cycle, for any $k \leq n / 2$. We complement this with a "gap covering" argument, which bounds the maximum value of the gap by bounding its value across all possible subsets of a certain structure, and recursively bounding the gaps within each subset. We provide analytical and experimental evidence that our upper bound on the gap is tight up to a logarithmic factor.
△ Less
Submitted 20 March, 2020;
originally announced March 2020.
-
Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent
Authors:
Giorgi Nadiradze,
Ilia Markov,
Bapi Chatterjee,
Vyacheslav Kungurtsev,
Dan Alistarh
Abstract:
Machine learning has made tremendous progress in recent years, with models matching or even surpassing humans on a series of specialized tasks. One key element behind the progress of machine learning in recent years has been the ability to train machine learning models in large-scale distributed shared-memory and message-passing environments. Many of these models are trained employing variants of…
▽ More
Machine learning has made tremendous progress in recent years, with models matching or even surpassing humans on a series of specialized tasks. One key element behind the progress of machine learning in recent years has been the ability to train machine learning models in large-scale distributed shared-memory and message-passing environments. Many of these models are trained employing variants of stochastic gradient descent (SGD) based optimization.
In this paper, we introduce a general consistency condition covering communication-reduced and asynchronous distributed SGD implementations. Our framework, called elastic consistency enables us to derive convergence bounds for a variety of distributed SGD methods used in practice to train large-scale machine learning models. The proposed framework de-clutters the implementation-specific convergence analysis and provides an abstraction to derive convergence bounds. We utilize the framework to analyze a sparsification scheme for distributed SGD methods in an asynchronous setting for convex and non-convex objectives. We implement the distributed SGD variant to train deep CNN models in an asynchronous shared-memory setting. Empirical results show that error-feedback may not necessarily help in improving the convergence of sparsified asynchronous distributed SGD, which corroborates an insight suggested by our convergence analysis.
△ Less
Submitted 28 June, 2020; v1 submitted 16 January, 2020;
originally announced January 2020.
-
Asynchronous Decentralized SGD with Quantized and Local Updates
Authors:
Giorgi Nadiradze,
Amirmojtaba Sabour,
Peter Davies,
Shigang Li,
Dan Alistarh
Abstract:
Decentralized optimization is emerging as a viable alternative for scalable distributed machine learning, but also introduces new challenges in terms of synchronization costs. To this end, several communication-reduction techniques, such as non-blocking communication, quantization, and local steps, have been explored in the decentralized setting. Due to the complexity of analyzing optimization in…
▽ More
Decentralized optimization is emerging as a viable alternative for scalable distributed machine learning, but also introduces new challenges in terms of synchronization costs. To this end, several communication-reduction techniques, such as non-blocking communication, quantization, and local steps, have been explored in the decentralized setting. Due to the complexity of analyzing optimization in such a relaxed setting, this line of work often assumes \emph{global} communication rounds, which require additional synchronization. In this paper, we consider decentralized optimization in the simpler, but harder to analyze, \emph{asynchronous gossip} model, in which communication occurs in discrete, randomly chosen pairings among nodes. Perhaps surprisingly, we show that a variant of SGD called \emph{SwarmSGD} still converges in this setting, even if \emph{non-blocking communication}, \emph{quantization}, and \emph{local steps} are all applied \emph{in conjunction}, and even if the node data distributions and underlying graph topology are both \emph{heterogenous}. Our analysis is based on a new connection with multi-dimensional load-balancing processes. We implement this algorithm and deploy it in a super-computing environment, showing that it can outperform previous decentralized methods in terms of end-to-end training time, and that it can even rival carefully-tuned large-batch SGD for certain tasks.
△ Less
Submitted 25 March, 2022; v1 submitted 27 October, 2019;
originally announced October 2019.
-
Relaxed Schedulers Can Efficiently Parallelize Iterative Algorithms
Authors:
Dan Alistarh,
Trevor Brown,
Justin Kopinsky,
Giorgi Nadiradze
Abstract:
There has been significant progress in understanding the parallelism inherent to iterative sequential algorithms: for many classic algorithms, the depth of the dependence structure is now well understood, and scheduling techniques have been developed to exploit this shallow dependence structure for efficient parallel implementations. A related, applied research strand has studied methods by which…
▽ More
There has been significant progress in understanding the parallelism inherent to iterative sequential algorithms: for many classic algorithms, the depth of the dependence structure is now well understood, and scheduling techniques have been developed to exploit this shallow dependence structure for efficient parallel implementations. A related, applied research strand has studied methods by which certain iterative task-based algorithms can be efficiently parallelized via relaxed concurrent priority schedulers. These allow for high concurrency when inserting and removing tasks, at the cost of executing superfluous work due to the relaxed semantics of the scheduler.
In this work, we take a step towards unifying these two research directions, by showing that there exists a family of relaxed priority schedulers that can efficiently and deterministically execute classic iterative algorithms such as greedy maximal independent set (MIS) and matching. Our primary result shows that, given a randomized scheduler with an expected relaxation factor of $k$ in terms of the maximum allowed priority inversions on a task, and any graph on $n$ vertices, the scheduler is able to execute greedy MIS with only an additive factor of poly($k$) expected additional iterations compared to an exact (but not scalable) scheduler. This counter-intuitive result demonstrates that the overhead of relaxation when computing MIS is not dependent on the input size or structure of the input graph. Experimental results show that this overhead can be clearly offset by the gain in performance due to the highly scalable scheduler. In sum, we present an efficient method to deterministically parallelize iterative sequential algorithms, with provable runtime guarantees in terms of the number of executed tasks to completion.
△ Less
Submitted 13 August, 2018;
originally announced August 2018.
-
Distributionally Linearizable Data Structures
Authors:
Dan Alistarh,
Trevor Brown,
Justin Kopinsky,
Jerry Z. Li,
Giorgi Nadiradze
Abstract:
Relaxed concurrent data structures have become increasingly popular, due to their scalability in graph processing and machine learning applications. Despite considerable interest, there exist families of natural, high performing randomized relaxed concurrent data structures, such as the popular MultiQueue pattern for implementing relaxed priority queue data structures, for which no guarantees are…
▽ More
Relaxed concurrent data structures have become increasingly popular, due to their scalability in graph processing and machine learning applications. Despite considerable interest, there exist families of natural, high performing randomized relaxed concurrent data structures, such as the popular MultiQueue pattern for implementing relaxed priority queue data structures, for which no guarantees are known in the concurrent setting. Our main contribution is in showing for the first time that, under a set of analytic assumptions, a family of relaxed concurrent data structures, including variants of MultiQueues, but also a new approximate counting algorithm we call the MultiCounter, provides strong probabilistic guarantees on the degree of relaxation with respect to the sequential specification, in arbitrary concurrent executions. We formalize these guarantees via a new correctness condition called distributional linearizability, tailored to concurrent implementations with randomized relaxations. Our result is based on a new analysis of an asynchronous variant of the classic power-of-two-choices load balancing algorithm, in which placement choices can be based on inconsistent, outdated information (this result may be of independent interest). We validate our results empirically, showing that the MultiCounter algorithm can implement scalable relaxed timestamps, which in turn can improve the performance of the classic TL2 transactional algorithm by up to 3 times, for some settings of parameters.
△ Less
Submitted 25 March, 2022; v1 submitted 3 April, 2018;
originally announced April 2018.
-
The Transactional Conflict Problem
Authors:
Dan Alistarh,
Syed Kamran Haider,
Raphael Kübler,
Giorgi Nadiradze
Abstract:
The transactional conflict problem arises in transactional systems whenever two or more concurrent transactions clash on a data item.
While the standard solution to such conflicts is to immediately abort one of the transactions, some practical systems consider the alternative of delaying conflict resolution for a short interval, which may allow one of the transactions to commit. The challenge in…
▽ More
The transactional conflict problem arises in transactional systems whenever two or more concurrent transactions clash on a data item.
While the standard solution to such conflicts is to immediately abort one of the transactions, some practical systems consider the alternative of delaying conflict resolution for a short interval, which may allow one of the transactions to commit. The challenge in the transactional conflict problem is to choose the optimal length of this delay interval so as to minimize the overall running time penalty for the conflicting transactions. In this paper, we propose a family of optimal online algorithms for the transactional conflict problem.
Specifically, we consider variants of this problem which arise in different implementations of transactional systems, namely "requestor wins" and "requestor aborts" implementations: in the former, the recipient of a coherence request is aborted, whereas in the latter, it is the requestor which has to abort. Both strategies are implemented by real systems.
We show that the requestor aborts case can be reduced to a classic instance of the ski rental problem, while the requestor wins case leads to a new version of this classical problem, for which we derive optimal deterministic and randomized algorithms.
Moreover, we prove that, under a simplified adversarial model, our algorithms are constant-competitive with the offline optimum in terms of throughput.
We validate our algorithmic results empirically through a hardware simulation of hardware transactional memory (HTM), showing that our algorithms can lead to non-trivial performance improvements for classic concurrent data structures.
△ Less
Submitted 3 April, 2018;
originally announced April 2018.
-
The Power of Choice in Priority Scheduling
Authors:
Dan Alistarh,
Justin Kopinsky,
Jerry Li,
Giorgi Nadiradze
Abstract:
Consider the following random process: we are given $n$ queues, into which elements of increasing labels are inserted uniformly at random. To remove an element, we pick two queues at random, and remove the element of lower label (higher priority) among the two. The cost of a removal is the rank of the label removed, among labels still present in any of the queues, that is, the distance from the op…
▽ More
Consider the following random process: we are given $n$ queues, into which elements of increasing labels are inserted uniformly at random. To remove an element, we pick two queues at random, and remove the element of lower label (higher priority) among the two. The cost of a removal is the rank of the label removed, among labels still present in any of the queues, that is, the distance from the optimal choice at each step. Variants of this strategy are prevalent in state-of-the-art concurrent priority queue implementations. Nonetheless, it is not known whether such implementations provide any rank guarantees, even in a sequential model.
We answer this question, showing that this strategy provides surprisingly strong guarantees: Although the single-choice process, where we always insert and remove from a single randomly chosen queue, has degrading cost, going to infinity as we increase the number of steps, in the two choice process, the expected rank of a removed element is $O( n )$ while the expected worst-case cost is $O( n \log n )$. These bounds are tight, and hold irrespective of the number of steps for which we run the process.
The argument is based on a new technical connection between "heavily loaded" balls-into-bins processes and priority scheduling.
Our analytic results inspire a new concurrent priority queue implementation, which improves upon the state of the art in terms of practical performance.
△ Less
Submitted 13 June, 2017;
originally announced June 2017.