Search | arXiv e-print repository

FedAST: Federated Asynchronous Simultaneous Training

Authors: Baris Askin, Pranay Sharma, Carlee Joe-Wong, Gauri Joshi

Abstract: Federated Learning (FL) enables edge devices or clients to collaboratively train machine learning (ML) models without sharing their private data. Much of the existing work in FL focuses on efficiently learning a model for a single task. In this paper, we study simultaneous training of multiple FL models using a common set of clients. The few existing simultaneous training methods employ synchronou… ▽ More Federated Learning (FL) enables edge devices or clients to collaboratively train machine learning (ML) models without sharing their private data. Much of the existing work in FL focuses on efficiently learning a model for a single task. In this paper, we study simultaneous training of multiple FL models using a common set of clients. The few existing simultaneous training methods employ synchronous aggregation of client updates, which can cause significant delays because large models and/or slow clients can bottleneck the aggregation. On the other hand, a naive asynchronous aggregation is adversely affected by stale client updates. We propose FedAST, a buffered asynchronous federated simultaneous training algorithm that overcomes bottlenecks from slow models and adaptively allocates client resources across heterogeneous tasks. We provide theoretical convergence guarantees for FedAST for smooth non-convex objective functions. Extensive experiments over multiple real-world datasets demonstrate that our proposed method outperforms existing simultaneous FL approaches, achieving up to 46.0% reduction in time to train multiple tasks to completion. △ Less

Submitted 1 June, 2024; originally announced June 2024.

Comments: Accepted to UAI 2024

arXiv:2403.12329 [pdf, other]

FedFisher: Leveraging Fisher Information for One-Shot Federated Learning

Authors: Divyansh Jhunjhunwala, Shiqiang Wang, Gauri Joshi

Abstract: Standard federated learning (FL) algorithms typically require multiple rounds of communication between the server and the clients, which has several drawbacks, including requiring constant network connectivity, repeated investment of computational resources, and susceptibility to privacy attacks. One-Shot FL is a new paradigm that aims to address this challenge by enabling the server to train a gl… ▽ More Standard federated learning (FL) algorithms typically require multiple rounds of communication between the server and the clients, which has several drawbacks, including requiring constant network connectivity, repeated investment of computational resources, and susceptibility to privacy attacks. One-Shot FL is a new paradigm that aims to address this challenge by enabling the server to train a global model in a single round of communication. In this work, we present FedFisher, a novel algorithm for one-shot FL that makes use of Fisher information matrices computed on local client models, motivated by a Bayesian perspective of FL. First, we theoretically analyze FedFisher for two-layer over-parameterized ReLU neural networks and show that the error of our one-shot FedFisher global model becomes vanishingly small as the width of the neural networks and amount of local training at clients increases. Next, we propose practical variants of FedFisher using the diagonal Fisher and K-FAC approximation for the full Fisher and highlight their communication and compute efficiency for FL. Finally, we conduct extensive experiments on various datasets, which show that these variants of FedFisher consistently improve over competing baselines. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: Accepted to AISTATS 2024

arXiv:2402.05876 [pdf, other]

Federated Offline Reinforcement Learning: Collaborative Single-Policy Coverage Suffices

Authors: Jiin Woo, Laixi Shi, Gauri Joshi, Yuejie Chi

Abstract: Offline reinforcement learning (RL), which seeks to learn an optimal policy using offline data, has garnered significant interest due to its potential in critical applications where online data collection is infeasible or expensive. This work explores the benefit of federated learning for offline RL, aiming at collaboratively leveraging offline datasets at multiple agents. Focusing on finite-horiz… ▽ More Offline reinforcement learning (RL), which seeks to learn an optimal policy using offline data, has garnered significant interest due to its potential in critical applications where online data collection is infeasible or expensive. This work explores the benefit of federated learning for offline RL, aiming at collaboratively leveraging offline datasets at multiple agents. Focusing on finite-horizon episodic tabular Markov decision processes (MDPs), we design FedLCB-Q, a variant of the popular model-free Q-learning algorithm tailored for federated offline RL. FedLCB-Q updates local Q-functions at agents with novel learning rate schedules and aggregates them at a central server using importance averaging and a carefully designed pessimistic penalty term. Our sample complexity analysis reveals that, with appropriately chosen parameters and synchronization schedules, FedLCB-Q achieves linear speedup in terms of the number of agents without requiring high-quality datasets at individual agents, as long as the local datasets collectively cover the state-action space visited by the optimal policy, highlighting the power of collaboration in the federated setting. In fact, the sample complexity almost matches that of the single-agent counterpart, as if all the data are stored at a central location, up to polynomial factors of the horizon length. Furthermore, FedLCB-Q is communication-efficient, where the number of communication rounds is only linear with respect to the horizon length up to logarithmic factors. △ Less

Submitted 8 February, 2024; originally announced February 2024.

arXiv:2402.01147 [pdf, other]

Efficient Reinforcement Learning for Routing Jobs in Heterogeneous Queueing Systems

Authors: Neharika Jali, Guannan Qu, Weina Wang, Gauri Joshi

Abstract: We consider the problem of efficiently routing jobs that arrive into a central queue to a system of heterogeneous servers. Unlike homogeneous systems, a threshold policy, that routes jobs to the slow server(s) when the queue length exceeds a certain threshold, is known to be optimal for the one-fast-one-slow two-server system. But an optimal policy for the multi-server system is unknown and non-tr… ▽ More We consider the problem of efficiently routing jobs that arrive into a central queue to a system of heterogeneous servers. Unlike homogeneous systems, a threshold policy, that routes jobs to the slow server(s) when the queue length exceeds a certain threshold, is known to be optimal for the one-fast-one-slow two-server system. But an optimal policy for the multi-server system is unknown and non-trivial to find. While Reinforcement Learning (RL) has been recognized to have great potential for learning policies in such cases, our problem has an exponentially large state space size, rendering standard RL inefficient. In this work, we propose ACHQ, an efficient policy gradient based algorithm with a low dimensional soft threshold policy parameterization that leverages the underlying queueing structure. We provide stationary-point convergence guarantees for the general case and despite the low-dimensional parameterization prove that ACHQ converges to an approximate global optimum for the special case of two servers. Simulations demonstrate an improvement in expected response time of up to ~30% over the greedy policy that routes to the fastest available server. △ Less

Submitted 21 April, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

Comments: AISTATS 2024; Corrected typos

arXiv:2401.06432 [pdf, other]

Heterogeneous LoRA for Federated Fine-tuning of On-Device Foundation Models

Authors: Yae Jee Cho, Luyang Liu, Zheng Xu, Aldi Fahrezi, Gauri Joshi

Abstract: Foundation models (FMs) adapt well to specific domains or tasks with fine-tuning, and federated learning (FL) enables the potential for privacy-preserving fine-tuning of the FMs with on-device local data. For federated fine-tuning of FMs, we consider the FMs with small to medium parameter sizes of single digit billion at maximum, referred to as on-device FMs (ODFMs) that can be deployed on devices… ▽ More Foundation models (FMs) adapt well to specific domains or tasks with fine-tuning, and federated learning (FL) enables the potential for privacy-preserving fine-tuning of the FMs with on-device local data. For federated fine-tuning of FMs, we consider the FMs with small to medium parameter sizes of single digit billion at maximum, referred to as on-device FMs (ODFMs) that can be deployed on devices for inference but can only be fine-tuned with parameter efficient methods. In our work, we tackle the data and system heterogeneity problem of federated fine-tuning of ODFMs by proposing a novel method using heterogeneous low-rank approximations (LoRAs), namely HetLoRA. First, we show that the naive approach of using homogeneous LoRA ranks across devices face a trade-off between overfitting and slow convergence, and thus propose HetLoRA, which allows heterogeneous ranks across client devices and efficiently aggregates and distributes these heterogeneous LoRA modules. By applying rank self-pruning locally and sparsity-weighted aggregation at the server, HetLoRA combines the advantages of high and low-rank LoRAs, which achieves improved convergence speed and final performance compared to homogeneous LoRA. Furthermore, HetLoRA offers enhanced computation efficiency compared to full fine-tuning, making it suitable for federated fine-tuning across heterogeneous devices. △ Less

Submitted 20 February, 2024; v1 submitted 12 January, 2024; originally announced January 2024.

arXiv:2310.18868 [pdf, other]

Correlation Aware Sparsified Mean Estimation Using Random Projection

Authors: Shuli Jiang, Pranay Sharma, Gauri Joshi

Abstract: We study the problem of communication-efficient distributed vector mean estimation, a commonly used subroutine in distributed optimization and Federated Learning (FL). Rand-$k$ sparsification is a commonly used technique to reduce communication cost, where each client sends $k < d$ of its coordinates to the server. However, Rand-$k$ is agnostic to any correlations, that might exist between clients… ▽ More We study the problem of communication-efficient distributed vector mean estimation, a commonly used subroutine in distributed optimization and Federated Learning (FL). Rand-$k$ sparsification is a commonly used technique to reduce communication cost, where each client sends $k < d$ of its coordinates to the server. However, Rand-$k$ is agnostic to any correlations, that might exist between clients in practical scenarios. The recently proposed Rand-$k$-Spatial estimator leverages the cross-client correlation information at the server to improve Rand-$k$'s performance. Yet, the performance of Rand-$k$-Spatial is suboptimal. We propose the Rand-Proj-Spatial estimator with a more flexible encoding-decoding procedure, which generalizes the encoding of Rand-$k$ by projecting the client vectors to a random $k$-dimensional subspace. We utilize Subsampled Randomized Hadamard Transform (SRHT) as the projection matrix and show that Rand-Proj-Spatial with SRHT outperforms Rand-$k$-Spatial, using the correlation information more efficiently. Furthermore, we propose an approach to incorporate varying degrees of correlation and suggest a practical variant of Rand-Proj-Spatial when the correlation information is not available to the server. Experiments on real-world distributed optimization tasks showcase the superior performance of Rand-Proj-Spatial compared to Rand-$k$-Spatial and other more sophisticated sparsification techniques. △ Less

Submitted 28 October, 2023; originally announced October 2023.

Comments: 32 pages, 13 figures. Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, USA

arXiv:2310.18784 [pdf, other]

High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise

Authors: Aleksandar Armacki, Pranay Sharma, Gauri Joshi, Dragana Bajovic, Dusan Jakovetic, Soummya Kar

Abstract: We study high-probability convergence guarantees of learning on streaming data in the presence of heavy-tailed noise. In the proposed scenario, the model is updated in an online fashion, as new information is observed, without storing any additional data. To combat the heavy-tailed noise, we consider a general framework of nonlinear stochastic gradient descent (SGD), providing several strong resul… ▽ More We study high-probability convergence guarantees of learning on streaming data in the presence of heavy-tailed noise. In the proposed scenario, the model is updated in an online fashion, as new information is observed, without storing any additional data. To combat the heavy-tailed noise, we consider a general framework of nonlinear stochastic gradient descent (SGD), providing several strong results. First, for non-convex costs and component-wise nonlinearities, we establish a convergence rate arbitrarily close to $\mathcal{O}\left(t^{-\frac{1}{4}}\right)$, whose exponent is independent of noise and problem parameters. Second, for strongly convex costs and component-wise nonlinearities, we establish a rate arbitrarily close to $\mathcal{O}\left(t^{-\frac{1}{2}}\right)$ for the weighted average of iterates, with exponent again independent of noise and problem parameters. Finally, for strongly convex costs and a broader class of nonlinearities, we establish convergence of the last iterate, with a rate $\mathcal{O}\left(t^{-ζ} \right)$, where $ζ\in (0,1)$ depends on problem parameters, noise and nonlinearity. As we show analytically and numerically, $ζ$ can be used to inform the preferred choice of nonlinearity for given problem settings. Compared to state-of-the-art, who only consider clipping, require bounded noise moments of order $η\in (1,2]$, and establish convergence rates whose exponents go to zero as $η\rightarrow 1$, we provide high-probability guarantees for a much broader class of nonlinearities and symmetric density noise, with convergence rates whose exponents are bounded away from zero, even when the noise has finite first moment only. Moreover, in the case of strongly convex functions, we demonstrate analytically and numerically that clipping is not always the optimal nonlinearity, further underlining the value of our general framework. △ Less

Submitted 30 April, 2024; v1 submitted 28 October, 2023; originally announced October 2023.

Comments: 30 pages, 3 figures

arXiv:2307.08809 [pdf, other]

Local or Global: Selective Knowledge Assimilation for Federated Learning with Limited Labels

Authors: Yae Jee Cho, Gauri Joshi, Dimitrios Dimitriadis

Abstract: Many existing FL methods assume clients with fully-labeled data, while in realistic settings, clients have limited labels due to the expensive and laborious process of labeling. Limited labeled local data of the clients often leads to their local model having poor generalization abilities to their larger unlabeled local data, such as having class-distribution mismatch with the unlabeled data. As a… ▽ More Many existing FL methods assume clients with fully-labeled data, while in realistic settings, clients have limited labels due to the expensive and laborious process of labeling. Limited labeled local data of the clients often leads to their local model having poor generalization abilities to their larger unlabeled local data, such as having class-distribution mismatch with the unlabeled data. As a result, clients may instead look to benefit from the global model trained across clients to leverage their unlabeled data, but this also becomes difficult due to data heterogeneity across clients. In our work, we propose FedLabel where clients selectively choose the local or global model to pseudo-label their unlabeled data depending on which is more of an expert of the data. We further utilize both the local and global models' knowledge via global-local consistency regularization which minimizes the divergence between the two models' outputs when they have identical pseudo-labels for the unlabeled data. Unlike other semi-supervised FL baselines, our method does not require additional experts other than the local or global model, nor require additional parameters to be communicated. We also do not assume any server-labeled data or fully labeled clients. For both cross-device and cross-silo settings, we show that FedLabel outperforms other semi-supervised FL baselines by $8$-$24\%$, and even outperforms standard fully supervised FL baselines ($100\%$ labeled data) with only $5$-$20\%$ of labeled data. △ Less

Submitted 17 July, 2023; originally announced July 2023.

Comments: To appear in the proceedings of ICCV 2023

arXiv:2305.10697 [pdf, other]

The Blessing of Heterogeneity in Federated Q-Learning: Linear Speedup and Beyond

Authors: Jiin Woo, Gauri Joshi, Yuejie Chi

Abstract: When the data used for reinforcement learning (RL) are collected by multiple agents in a distributed manner, federated versions of RL algorithms allow collaborative learning without the need for agents to share their local data. In this paper, we consider federated Q-learning, which aims to learn an optimal Q-function by periodically aggregating local Q-estimates trained on local data alone. Focus… ▽ More When the data used for reinforcement learning (RL) are collected by multiple agents in a distributed manner, federated versions of RL algorithms allow collaborative learning without the need for agents to share their local data. In this paper, we consider federated Q-learning, which aims to learn an optimal Q-function by periodically aggregating local Q-estimates trained on local data alone. Focusing on infinite-horizon tabular Markov decision processes, we provide sample complexity guarantees for both the synchronous and asynchronous variants of federated Q-learning. In both cases, our bounds exhibit a linear speedup with respect to the number of agents and near-optimal dependencies on other salient problem parameters. In the asynchronous setting, existing analyses of federated Q-learning, which adopt an equally weighted averaging of local Q-estimates, require that every agent covers the entire state-action space. In contrast, our improved sample complexity scales inverse proportionally to the minimum entry of the average stationary state-action occupancy distribution of all agents, thus only requiring the agents to collectively cover the entire state-action space, unveiling the blessing of heterogeneity in enabling collaborative learning by relaxing the coverage requirement of the single-agent case. However, its sample complexity still suffers when the local trajectories are highly heterogeneous. In response, we propose a novel federated Q-learning algorithm with importance averaging, giving larger weights to more frequently visited state-action pairs, which achieves a robust linear speedup as if all trajectories are centrally processed, regardless of the heterogeneity of local behavior policies. △ Less

Submitted 12 December, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

Comments: Short version at ICML 2023

arXiv:2302.04249 [pdf, other]

Federated Minimax Optimization with Client Heterogeneity

Authors: Pranay Sharma, Rohan Panda, Gauri Joshi

Abstract: Minimax optimization has seen a surge in interest with the advent of modern applications such as GANs, and it is inherently more challenging than simple minimization. The difficulty is exacerbated by the training data residing at multiple edge devices or \textit{clients}, especially when these clients can have heterogeneous datasets and local computation capabilities. We propose a general federate… ▽ More Minimax optimization has seen a surge in interest with the advent of modern applications such as GANs, and it is inherently more challenging than simple minimization. The difficulty is exacerbated by the training data residing at multiple edge devices or \textit{clients}, especially when these clients can have heterogeneous datasets and local computation capabilities. We propose a general federated minimax optimization framework that subsumes such settings and several existing methods like Local SGDA. We show that naive aggregation of heterogeneous local progress results in optimizing a mismatched objective function -- a phenomenon previously observed in standard federated minimization. To fix this problem, we propose normalizing the client updates by the number of local steps undertaken between successive communication rounds. We analyze the convergence of the proposed algorithm for classes of nonconvex-concave and nonconvex-nonconcave functions and characterize the impact of heterogeneous client data, partial client participation, and heterogeneous local computations. Our analysis works under more general assumptions on the intra-client noise and inter-client heterogeneity than so far considered in the literature. For all the function classes considered, we significantly improve the existing computation and communication complexity results. Experimental results support our theoretical claims. △ Less

Submitted 9 February, 2023; v1 submitted 8 February, 2023; originally announced February 2023.

Comments: 52 pages, 8 figures

arXiv:2302.03109 [pdf, other]

On the Convergence of Federated Averaging with Cyclic Client Participation

Authors: Yae Jee Cho, Pranay Sharma, Gauri Joshi, Zheng Xu, Satyen Kale, Tong Zhang

Abstract: Federated Averaging (FedAvg) and its variants are the most popular optimization algorithms in federated learning (FL). Previous convergence analyses of FedAvg either assume full client participation or partial client participation where the clients can be uniformly sampled. However, in practical cross-device FL systems, only a subset of clients that satisfy local criteria such as battery status, n… ▽ More Federated Averaging (FedAvg) and its variants are the most popular optimization algorithms in federated learning (FL). Previous convergence analyses of FedAvg either assume full client participation or partial client participation where the clients can be uniformly sampled. However, in practical cross-device FL systems, only a subset of clients that satisfy local criteria such as battery status, network connectivity, and maximum participation frequency requirements (to ensure privacy) are available for training at a given time. As a result, client availability follows a natural cyclic pattern. We provide (to our knowledge) the first theoretical framework to analyze the convergence of FedAvg with cyclic client participation with several different client optimizers such as GD, SGD, and shuffled SGD. Our analysis discovers that cyclic client participation can achieve a faster asymptotic convergence rate than vanilla FedAvg with uniform client participation under suitable conditions, providing valuable insights into the design of client sampling protocols. △ Less

Submitted 6 February, 2023; originally announced February 2023.

arXiv:2301.09604 [pdf, other]

FedExP: Speeding Up Federated Averaging via Extrapolation

Authors: Divyansh Jhunjhunwala, Shiqiang Wang, Gauri Joshi

Abstract: Federated Averaging (FedAvg) remains the most popular algorithm for Federated Learning (FL) optimization due to its simple implementation, stateless nature, and privacy guarantees combined with secure aggregation. Recent work has sought to generalize the vanilla averaging in FedAvg to a generalized gradient descent step by treating client updates as pseudo-gradients and using a server step size. W… ▽ More Federated Averaging (FedAvg) remains the most popular algorithm for Federated Learning (FL) optimization due to its simple implementation, stateless nature, and privacy guarantees combined with secure aggregation. Recent work has sought to generalize the vanilla averaging in FedAvg to a generalized gradient descent step by treating client updates as pseudo-gradients and using a server step size. While the use of a server step size has been shown to provide performance improvement theoretically, the practical benefit of the server step size has not been seen in most existing works. In this work, we present FedExP, a method to adaptively determine the server step size in FL based on dynamically varying pseudo-gradients throughout the FL process. We begin by considering the overparameterized convex regime, where we reveal an interesting similarity between FedAvg and the Projection Onto Convex Sets (POCS) algorithm. We then show how FedExP can be motivated as a novel extension to the extrapolation mechanism that is used to speed up POCS. Our theoretical analysis later also discusses the implications of FedExP in underparameterized and non-convex settings. Experimental results show that FedExP consistently converges faster than FedAvg and competing baselines on a range of realistic FL datasets. △ Less

Submitted 6 March, 2023; v1 submitted 23 January, 2023; originally announced January 2023.

Comments: Accepted to ICLR 2023. V2 fixes minor typos and cleans up proofs

arXiv:2207.14130 [pdf, other]

FedVARP: Tackling the Variance Due to Partial Client Participation in Federated Learning

Authors: Divyansh Jhunjhunwala, Pranay Sharma, Aushim Nagarkatti, Gauri Joshi

Abstract: Data-heterogeneous federated learning (FL) systems suffer from two significant sources of convergence error: 1) client drift error caused by performing multiple local optimization steps at clients, and 2) partial client participation error caused by the fact that only a small subset of the edge clients participate in every training round. We find that among these, only the former has received sign… ▽ More Data-heterogeneous federated learning (FL) systems suffer from two significant sources of convergence error: 1) client drift error caused by performing multiple local optimization steps at clients, and 2) partial client participation error caused by the fact that only a small subset of the edge clients participate in every training round. We find that among these, only the former has received significant attention in the literature. To remedy this, we propose FedVARP, a novel variance reduction algorithm applied at the server that eliminates error due to partial client participation. To do so, the server simply maintains in memory the most recent update for each client and uses these as surrogate updates for the non-participating clients in every round. Further, to alleviate the memory requirement at the server, we propose a novel clustering-based variance reduction algorithm ClusterFedVARP. Unlike previously proposed methods, both FedVARP and ClusterFedVARP do not require additional computation at clients or communication of additional optimization parameters. Through extensive experiments, we show that FedVARP outperforms state-of-the-art methods, and ClusterFedVARP achieves performance comparable to FedVARP with much less memory requirements. △ Less

Submitted 28 July, 2022; originally announced July 2022.

Comments: Accepted to UAI 2022

arXiv:2207.10992 [pdf, other]

Taguchi based Design of Sequential Convolution Neural Network for Classification of Defective Fasteners

Authors: Manjeet Kaur, Krishan Kumar Chauhan, Tanya Aggarwal, Pushkar Bharadwaj, Renu Vig, Isibor Kennedy Ihianle, Garima Joshi, Kayode Owa

Abstract: Fasteners play a critical role in securing various parts of machinery. Deformations such as dents, cracks, and scratches on the surface of fasteners are caused by material properties and incorrect handling of equipment during production processes. As a result, quality control is required to ensure safe and reliable operations. The existing defect inspection method relies on manual examination, whi… ▽ More Fasteners play a critical role in securing various parts of machinery. Deformations such as dents, cracks, and scratches on the surface of fasteners are caused by material properties and incorrect handling of equipment during production processes. As a result, quality control is required to ensure safe and reliable operations. The existing defect inspection method relies on manual examination, which consumes a significant amount of time, money, and other resources; also, accuracy cannot be guaranteed due to human error. Automatic defect detection systems have proven impactful over the manual inspection technique for defect analysis. However, computational techniques such as convolutional neural networks (CNN) and deep learning-based approaches are evolutionary methods. By carefully selecting the design parameter values, the full potential of CNN can be realised. Using Taguchi-based design of experiments and analysis, an attempt has been made to develop a robust automatic system in this study. The dataset used to train the system has been created manually for M14 size nuts having two labeled classes: Defective and Non-defective. There are a total of 264 images in the dataset. The proposed sequential CNN comes up with a 96.3% validation accuracy, 0.277 validation loss at 0.001 learning rate. △ Less

Submitted 22 July, 2022; originally announced July 2022.

Comments: 13 pages, 6 figures

arXiv:2207.04330 [pdf, ps, other]

Multi-Model Federated Learning with Provable Guarantees

Authors: Neelkamal Bhuyan, Sharayu Moharir, Gauri Joshi

Abstract: Federated Learning (FL) is a variant of distributed learning where edge devices collaborate to learn a model without sharing their data with the central server or each other. We refer to the process of training multiple independent models simultaneously in a federated setting using a common pool of clients as multi-model FL. In this work, we propose two variants of the popular FedAvg algorithm for… ▽ More Federated Learning (FL) is a variant of distributed learning where edge devices collaborate to learn a model without sharing their data with the central server or each other. We refer to the process of training multiple independent models simultaneously in a federated setting using a common pool of clients as multi-model FL. In this work, we propose two variants of the popular FedAvg algorithm for multi-model FL, with provable convergence guarantees. We further show that for the same amount of computation, multi-model FL can have better performance than training each model separately. We supplement our theoretical results with experiments in strongly convex, convex, and non-convex settings. △ Less

Submitted 20 September, 2022; v1 submitted 9 July, 2022; originally announced July 2022.

arXiv:2207.03983 [pdf, other]

Tackling Heterogeneous Traffic in Multi-access Systems via Erasure Coded Servers

Authors: Tuhinangshu Choudhury, Weina Wang, Gauri Joshi

Abstract: Most data generated by modern applications is stored in the cloud, and there is an exponential growth in the volume of jobs to access these data and perform computations using them. The volume of data access or computing jobs can be heterogeneous across different job types and can unpredictably change over time. Cloud service providers cope with this demand heterogeneity and unpredictability by ov… ▽ More Most data generated by modern applications is stored in the cloud, and there is an exponential growth in the volume of jobs to access these data and perform computations using them. The volume of data access or computing jobs can be heterogeneous across different job types and can unpredictably change over time. Cloud service providers cope with this demand heterogeneity and unpredictability by over-provisioning the number of servers hosting each job type. In this paper, we propose the addition of erasure-coded servers that can flexibly serve multiple job types without additional storage cost. We analyze the service capacity region and the response time of such erasure-coded systems and compare them with standard uncoded replication-based systems currently used in the cloud. We show that coding expands the service capacity region, thus enabling the system to handle variability in demand for different data types. Moreover, we characterize the response time of the coded system in various arrival rate regimes. This analysis reveals that adding even a small number of coded servers can significantly reduce the mean response time, with a drastic reduction in regimes where the demand is skewed across different job types. △ Less

Submitted 14 August, 2022; v1 submitted 8 July, 2022; originally announced July 2022.

arXiv:2206.10185 [pdf, other]

Federated Reinforcement Learning: Linear Speedup Under Markovian Sampling

Authors: Sajad Khodadadian, Pranay Sharma, Gauri Joshi, Siva Theja Maguluri

Abstract: Since reinforcement learning algorithms are notoriously data-intensive, the task of sampling observations from the environment is usually split across multiple agents. However, transferring these observations from the agents to a central location can be prohibitively expensive in terms of the communication cost, and it can also compromise the privacy of each agent's local behavior policy. In this… ▽ More Since reinforcement learning algorithms are notoriously data-intensive, the task of sampling observations from the environment is usually split across multiple agents. However, transferring these observations from the agents to a central location can be prohibitively expensive in terms of the communication cost, and it can also compromise the privacy of each agent's local behavior policy. In this paper, we consider a federated reinforcement learning framework where multiple agents collaboratively learn a global model, without sharing their individual data and policies. Each agent maintains a local copy of the model and updates it using locally sampled data. Although having N agents enables the sampling of N times more data, it is not clear if it leads to proportional convergence speedup. We propose federated versions of on-policy TD, off-policy TD and Q-learning, and analyze their convergence. For all these algorithms, to the best of our knowledge, we are the first to consider Markovian noise and multiple local updates, and prove a linear convergence speedup with respect to the number of agents. To obtain these results, we show that federated TD and Q-learning are special cases of a general framework for federated stochastic approximation with Markovian noise, and we leverage this framework to provide a unified convergence analysis that applies to all the algorithms. △ Less

Submitted 21 June, 2022; originally announced June 2022.

Comments: 69 pages, 1 figure, accepted to ICML 2022 for long presentation

arXiv:2206.04723 [pdf, other]

On the Unreasonable Effectiveness of Federated Averaging with Heterogeneous Data

Authors: Jianyu Wang, Rudrajit Das, Gauri Joshi, Satyen Kale, Zheng Xu, Tong Zhang

Abstract: Existing theory predicts that data heterogeneity will degrade the performance of the Federated Averaging (FedAvg) algorithm in federated learning. However, in practice, the simple FedAvg algorithm converges very well. This paper explains the seemingly unreasonable effectiveness of FedAvg that contradicts the previous theoretical predictions. We find that the key assumption of bounded gradient diss… ▽ More Existing theory predicts that data heterogeneity will degrade the performance of the Federated Averaging (FedAvg) algorithm in federated learning. However, in practice, the simple FedAvg algorithm converges very well. This paper explains the seemingly unreasonable effectiveness of FedAvg that contradicts the previous theoretical predictions. We find that the key assumption of bounded gradient dissimilarity in previous theoretical analyses is too pessimistic to characterize data heterogeneity in practical applications. For a simple quadratic problem, we demonstrate there exist regimes where large gradient dissimilarity does not have any negative impact on the convergence of FedAvg. Motivated by this observation, we propose a new quantity, average drift at optimum, to measure the effects of data heterogeneity, and explicitly use it to present a new theoretical analysis of FedAvg. We show that the average drift at optimum is nearly zero across many real-world federated training tasks, whereas the gradient dissimilarity can be large. And our new analysis suggests FedAvg can have identical convergence rates in homogeneous and heterogeneous data settings, and hence, leads to better understanding of its empirical success. △ Less

Submitted 9 June, 2022; originally announced June 2022.

arXiv:2206.00799 [pdf, other]

Federated Learning under Distributed Concept Drift

Authors: Ellango Jothimurugesan, Kevin Hsieh, Jianyu Wang, Gauri Joshi, Phillip B. Gibbons

Abstract: Federated Learning (FL) under distributed concept drift is a largely unexplored area. Although concept drift is itself a well-studied phenomenon, it poses particular challenges for FL, because drifts arise staggered in time and space (across clients). To the best of our knowledge, this work is the first to explicitly study data heterogeneity in both dimensions. We first demonstrate that prior solu… ▽ More Federated Learning (FL) under distributed concept drift is a largely unexplored area. Although concept drift is itself a well-studied phenomenon, it poses particular challenges for FL, because drifts arise staggered in time and space (across clients). To the best of our knowledge, this work is the first to explicitly study data heterogeneity in both dimensions. We first demonstrate that prior solutions to drift adaptation that use a single global model are ill-suited to staggered drifts, necessitating multiple-model solutions. We identify the problem of drift adaptation as a time-varying clustering problem, and we propose two new clustering algorithms for reacting to drifts based on local drift detection and hierarchical clustering. Empirical evaluation shows that our solutions achieve significantly higher accuracy than existing baselines, and are comparable to an idealized algorithm with oracle knowledge of the ground-truth clustering of clients to concepts at each time step. △ Less

Submitted 27 February, 2023; v1 submitted 1 June, 2022; originally announced June 2022.

Comments: 20 pages. Published in AISTATS 2023

ACM Class: I.2.6

arXiv:2205.14840 [pdf, other]

Maximizing Global Model Appeal in Federated Learning

Authors: Yae Jee Cho, Divyansh Jhunjhunwala, Tian Li, Virginia Smith, Gauri Joshi

Abstract: Federated learning typically considers collaboratively training a global model using local data at edge clients. Clients may have their own individual requirements, such as having a minimal training loss threshold, which they expect to be met by the global model. However, due to client heterogeneity, the global model may not meet each client's requirements, and only a small subset may find the glo… ▽ More Federated learning typically considers collaboratively training a global model using local data at edge clients. Clients may have their own individual requirements, such as having a minimal training loss threshold, which they expect to be met by the global model. However, due to client heterogeneity, the global model may not meet each client's requirements, and only a small subset may find the global model appealing. In this work, we explore the problem of the global model lacking appeal to the clients due to not being able to satisfy local requirements. We propose MaxFL, which aims to maximize the number of clients that find the global model appealing. We show that having a high global model appeal is important to maintain an adequate pool of clients for training, and can directly improve the test accuracy on both seen and unseen clients. We provide convergence guarantees for MaxFL and show that MaxFL achieves a $22$-$40\%$ and $18$-$50\%$ test accuracy improvement for the training clients and unseen clients respectively, compared to a wide range of FL modeling approaches, including those that tackle data heterogeneity, aim to incentivize clients, and learn personalized or fair models. △ Less

Submitted 4 February, 2023; v1 submitted 30 May, 2022; originally announced May 2022.

arXiv:2204.12703 [pdf, other]

Heterogeneous Ensemble Knowledge Transfer for Training Large Models in Federated Learning

Authors: Yae Jee Cho, Andre Manoel, Gauri Joshi, Robert Sim, Dimitrios Dimitriadis

Abstract: Federated learning (FL) enables edge-devices to collaboratively learn a model without disclosing their private data to a central aggregating server. Most existing FL algorithms require models of identical architecture to be deployed across the clients and server, making it infeasible to train large models due to clients' limited system resources. In this work, we propose a novel ensemble knowledge… ▽ More Federated learning (FL) enables edge-devices to collaboratively learn a model without disclosing their private data to a central aggregating server. Most existing FL algorithms require models of identical architecture to be deployed across the clients and server, making it infeasible to train large models due to clients' limited system resources. In this work, we propose a novel ensemble knowledge transfer method named Fed-ET in which small models (different in architecture) are trained on clients, and used to train a larger model at the server. Unlike in conventional ensemble learning, in FL the ensemble can be trained on clients' highly heterogeneous data. Cognizant of this property, Fed-ET uses a weighted consensus distillation scheme with diversity regularization that efficiently extracts reliable consensus from the ensemble while improving generalization by exploiting the diversity within the ensemble. We show the generalization bound for the ensemble of weighted models trained on heterogeneous datasets that supports the intuition of Fed-ET. Our experiments on image and language tasks show that Fed-ET significantly outperforms other state-of-the-art FL algorithms with fewer communicated parameters, and is also robust against high data-heterogeneity. △ Less

Submitted 27 April, 2022; originally announced April 2022.

Comments: To appear in the proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI 2022)

arXiv:2203.11724 [pdf]

Explainable Misinformation Detection Across Multiple Social Media Platforms

Authors: Gargi Joshi, Ananya Srivastava, Bhargav Yagnik, Mohammed Hasan, Zainuddin Saiyed, Lubna A Gabralla, Ajith Abraham, Rahee Walambe, Ketan Kotecha

Abstract: In this work, the integration of two machine learning approaches, namely domain adaptation and explainable AI, is proposed to address these two issues of generalized detection and explainability. Firstly the Domain Adversarial Neural Network (DANN) develops a generalized misinformation detector across multiple social media platforms DANN is employed to generate the classification results for test… ▽ More In this work, the integration of two machine learning approaches, namely domain adaptation and explainable AI, is proposed to address these two issues of generalized detection and explainability. Firstly the Domain Adversarial Neural Network (DANN) develops a generalized misinformation detector across multiple social media platforms DANN is employed to generate the classification results for test domains with relevant but unseen data. The DANN-based model, a traditional black-box model, cannot justify its outcome, i.e., the labels for the target domain. Hence a Local Interpretable Model-Agnostic Explanations (LIME) explainable AI model is applied to explain the outcome of the DANN mode. To demonstrate these two approaches and their integration for effective explainable generalized detection, COVID-19 misinformation is considered a case study. We experimented with two datasets, namely CoAID and MiSoVac, and compared results with and without DANN implementation. DANN significantly improves the accuracy measure F1 classification score and increases the accuracy and AUC performance. The results obtained show that the proposed framework performs well in the case of domain shift and can learn domain-invariant features while explaining the target labels with LIME implementation enabling trustworthy information processing and extraction to combat misinformation effectively. △ Less

Submitted 19 September, 2022; v1 submitted 20 March, 2022; originally announced March 2022.

Comments: 28 pages,4 figures

arXiv:2203.04850 [pdf, other]

Federated Minimax Optimization: Improved Convergence Analyses and Algorithms

Authors: Pranay Sharma, Rohan Panda, Gauri Joshi, Pramod K. Varshney

Abstract: In this paper, we consider nonconvex minimax optimization, which is gaining prominence in many modern machine learning applications such as GANs. Large-scale edge-based collection of training data in these applications calls for communication-efficient distributed optimization algorithms, such as those used in federated learning, to process the data. In this paper, we analyze Local stochastic grad… ▽ More In this paper, we consider nonconvex minimax optimization, which is gaining prominence in many modern machine learning applications such as GANs. Large-scale edge-based collection of training data in these applications calls for communication-efficient distributed optimization algorithms, such as those used in federated learning, to process the data. In this paper, we analyze Local stochastic gradient descent ascent (SGDA), the local-update version of the SGDA algorithm. SGDA is the core algorithm used in minimax optimization, but it is not well-understood in a distributed setting. We prove that Local SGDA has \textit{order-optimal} sample complexity for several classes of nonconvex-concave and nonconvex-nonconcave minimax problems, and also enjoys \textit{linear speedup} with respect to the number of clients. We provide a novel and tighter analysis, which improves the convergence and communication guarantees in the existing literature. For nonconvex-PL and nonconvex-one-point-concave functions, we improve the existing complexity results for centralized minimax problems. Furthermore, we propose a momentum-based local-update algorithm, which has the same convergence guarantees, but outperforms Local SGDA as demonstrated in our experiments. △ Less

Submitted 9 March, 2022; originally announced March 2022.

Comments: 52 pages, 4 figures

arXiv:2201.11865 [pdf, other]

FedLite: A Scalable Approach for Federated Learning on Resource-constrained Clients

Authors: Jianyu Wang, Hang Qi, Ankit Singh Rawat, Sashank Reddi, Sagar Waghmare, Felix X. Yu, Gauri Joshi

Abstract: In classical federated learning, the clients contribute to the overall training by communicating local updates for the underlying model on their private data to a coordinating server. However, updating and communicating the entire model becomes prohibitively expensive when resource-constrained clients collectively aim to train a large machine learning model. Split learning provides a natural solut… ▽ More In classical federated learning, the clients contribute to the overall training by communicating local updates for the underlying model on their private data to a coordinating server. However, updating and communicating the entire model becomes prohibitively expensive when resource-constrained clients collectively aim to train a large machine learning model. Split learning provides a natural solution in such a setting, where only a small part of the model is stored and trained on clients while the remaining large part of the model only stays at the servers. However, the model partitioning employed in split learning introduces a significant amount of communication cost. This paper addresses this issue by compressing the additional communication using a novel clustering scheme accompanied by a gradient correction method. Extensive empirical evaluations on image and text benchmarks show that the proposed method can achieve up to $490\times$ communication cost reduction with minimal drop in accuracy, and enables a desirable performance vs. communication trade-off. △ Less

Submitted 16 February, 2022; v1 submitted 27 January, 2022; originally announced January 2022.

arXiv:2110.07751 [pdf, other]

Leveraging Spatial and Temporal Correlations in Sparsified Mean Estimation

Authors: Divyansh Jhunjhunwala, Ankur Mallick, Advait Gadhikar, Swanand Kadhe, Gauri Joshi

Abstract: We study the problem of estimating at a central server the mean of a set of vectors distributed across several nodes (one vector per node). When the vectors are high-dimensional, the communication cost of sending entire vectors may be prohibitive, and it may be imperative for them to use sparsification techniques. While most existing work on sparsified mean estimation is agnostic to the characteri… ▽ More We study the problem of estimating at a central server the mean of a set of vectors distributed across several nodes (one vector per node). When the vectors are high-dimensional, the communication cost of sending entire vectors may be prohibitive, and it may be imperative for them to use sparsification techniques. While most existing work on sparsified mean estimation is agnostic to the characteristics of the data vectors, in many practical applications such as federated learning, there may be spatial correlations (similarities in the vectors sent by different nodes) or temporal correlations (similarities in the data sent by a single node over different iterations of the algorithm) in the data vectors. We leverage these correlations by simply modifying the decoding method used by the server to estimate the mean. We provide an analysis of the resulting estimation error as well as experiments for PCA, K-Means and Logistic Regression, which show that our estimators consistently outperform more sophisticated and expensive sparsification methods. △ Less

Submitted 14 October, 2021; originally announced October 2021.

Comments: Accepted to NeurIPS 2021

arXiv:2109.08119 [pdf, other]

Personalized Federated Learning for Heterogeneous Clients with Clustered Knowledge Transfer

Authors: Yae Jee Cho, Jianyu Wang, Tarun Chiruvolu, Gauri Joshi

Abstract: Personalized federated learning (FL) aims to train model(s) that can perform well for individual clients that are highly data and system heterogeneous. Most work in personalized FL, however, assumes using the same model architecture at all clients and increases the communication cost by sending/receiving models. This may not be feasible for realistic scenarios of FL. In practice, clients have high… ▽ More Personalized federated learning (FL) aims to train model(s) that can perform well for individual clients that are highly data and system heterogeneous. Most work in personalized FL, however, assumes using the same model architecture at all clients and increases the communication cost by sending/receiving models. This may not be feasible for realistic scenarios of FL. In practice, clients have highly heterogeneous system-capabilities and limited communication resources. In our work, we propose a personalized FL framework, PerFed-CKT, where clients can use heterogeneous model architectures and do not directly communicate their model parameters. PerFed-CKT uses clustered co-distillation, where clients use logits to transfer their knowledge to other clients that have similar data-distributions. We theoretically show the convergence and generalization properties of PerFed-CKT and empirically show that PerFed-CKT achieves high test accuracy with several orders of magnitude lower communication cost compared to the state-of-the-art personalized FL schemes. △ Less

Submitted 16 September, 2021; originally announced September 2021.

arXiv:2109.04941 [pdf, other]

doi 10.1109/JSAIT.2021.3082028

Best-Arm Identification in Correlated Multi-Armed Bandits

Authors: Samarth Gupta, Gauri Joshi, Osman Yağan

Abstract: In this paper we consider the problem of best-arm identification in multi-armed bandits in the fixed confidence setting, where the goal is to identify, with probability $1-δ$ for some $δ>0$, the arm with the highest mean reward in minimum possible samples from the set of arms $\mathcal{K}$. Most existing best-arm identification algorithms and analyses operate under the assumption that the rewards… ▽ More In this paper we consider the problem of best-arm identification in multi-armed bandits in the fixed confidence setting, where the goal is to identify, with probability $1-δ$ for some $δ>0$, the arm with the highest mean reward in minimum possible samples from the set of arms $\mathcal{K}$. Most existing best-arm identification algorithms and analyses operate under the assumption that the rewards corresponding to different arms are independent of each other. We propose a novel correlated bandit framework that captures domain knowledge about correlation between arms in the form of upper bounds on expected conditional reward of an arm, given a reward realization from another arm. Our proposed algorithm C-LUCB, which generalizes the LUCB algorithm utilizes this partial knowledge of correlations to sharply reduce the sample complexity of best-arm identification. More interestingly, we show that the total samples obtained by C-LUCB are of the form $\mathcal{O}\left(\sum_{k \in \mathcal{C}} \log\left(\frac{1}δ\right)\right)$ as opposed to the typical $\mathcal{O}\left(\sum_{k \in \mathcal{K}} \log\left(\frac{1}δ\right)\right)$ samples required in the independent reward setting. The improvement comes, as the $\mathcal{O}(\log(1/δ))$ term is summed only for the set of competitive arms $\mathcal{C}$, which is a subset of the original set of arms $\mathcal{K}$. The size of the set $\mathcal{C}$, depending on the problem setting, can be as small as $2$, and hence using C-LUCB in the correlated bandits setting can lead to significant performance improvements. Our theoretical findings are supported by experiments on the Movielens and Goodreads recommendation datasets. △ Less

Submitted 10 September, 2021; originally announced September 2021.

Journal ref: IEEE Journal on Selected Areas in Information Theory (Volume: 2, Issue: 2, June 2021)

arXiv:2108.03120 [pdf, other]

Stochastic Deep Model Reference Adaptive Control

Authors: Girish Joshi, Girish Chowdhary

Abstract: In this paper, we present a Stochastic Deep Neural Network-based Model Reference Adaptive Control. Building on our work "Deep Model Reference Adaptive Control", we extend the controller capability by using Bayesian deep neural networks (DNN) to represent uncertainties and model non-linearities. Stochastic Deep Model Reference Adaptive Control uses a Lyapunov-based method to adapt the output-layer… ▽ More In this paper, we present a Stochastic Deep Neural Network-based Model Reference Adaptive Control. Building on our work "Deep Model Reference Adaptive Control", we extend the controller capability by using Bayesian deep neural networks (DNN) to represent uncertainties and model non-linearities. Stochastic Deep Model Reference Adaptive Control uses a Lyapunov-based method to adapt the output-layer weights of the DNN model in real-time, while a data-driven supervised learning algorithm is used to update the inner-layers parameters. This asynchronous network update ensures boundedness and guaranteed tracking performance with a learning-based real-time feedback controller. A Bayesian approach to DNN learning helped avoid over-fitting the data and provide confidence intervals over the predictions. The controller's stochastic nature also ensured "Induced Persistency of excitation," leading to convergence of the overall system signal. △ Less

Submitted 4 August, 2021; originally announced August 2021.

Comments: Accepted in IEEE CDC 2021

arXiv:2107.06917 [pdf, other]

A Field Guide to Federated Optimization

Authors: Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H. Brendan McMahan, Blaise Aguera y Arcas, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, Suhas Diggavi, Hubert Eichner, Advait Gadhikar, Zachary Garrett, Antonious M. Girgis, Filip Hanzely, Andrew Hard, Chaoyang He, Samuel Horvath, Zhouyuan Huo, Alex Ingerman, Martin Jaggi, Tara Javidi, Peter Kairouz , et al. (28 additional authors not shown)

Abstract: Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and… ▽ More Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and other constraints that are not primary considerations in other problem settings. This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. The goal of this work is not to survey the current literature, but to inspire researchers and practitioners to design federated learning algorithms that can be used in various practical applications. △ Less

Submitted 14 July, 2021; originally announced July 2021.

arXiv:2106.04707 [pdf, other]

doi 10.1145/3466772.3467047

Job Dispatching Policies for Queueing Systems with Unknown Service Rates

Authors: Tuhinangshu Choudhury, Gauri Joshi, Weina Wang, Sanjay Shakkottai

Abstract: In multi-server queueing systems where there is no central queue holding all incoming jobs, job dispatching policies are used to assign incoming jobs to the queue at one of the servers. Classic job dispatching policies such as join-the-shortest-queue and shortest expected delay assume that the service rates and queue lengths of the servers are known to the dispatcher. In this work, we tackle the p… ▽ More In multi-server queueing systems where there is no central queue holding all incoming jobs, job dispatching policies are used to assign incoming jobs to the queue at one of the servers. Classic job dispatching policies such as join-the-shortest-queue and shortest expected delay assume that the service rates and queue lengths of the servers are known to the dispatcher. In this work, we tackle the problem of job dispatching without the knowledge of service rates and queue lengths, where the dispatcher can only obtain noisy estimates of the service rates by observing job departures. This problem presents a novel exploration-exploitation trade-off between sending jobs to all the servers to estimate their service rates, and exploiting the currently known fastest servers to minimize the expected queueing delay. We propose a bandit-based exploration policy that learns the service rates from observed job departures. Unlike the standard multi-armed bandit problem where only one out of a finite set of actions is optimal, here the optimal policy requires identifying the optimal fraction of incoming jobs to be sent to each server. We present a regret analysis and simulations to demonstrate the effectiveness of the proposed bandit-based exploration policy. △ Less

Submitted 10 June, 2021; v1 submitted 8 June, 2021; originally announced June 2021.

arXiv:2106.02305 [pdf, other]

Local Adaptivity in Federated Learning: Convergence and Consistency

Authors: Jianyu Wang, Zheng Xu, Zachary Garrett, Zachary Charles, Luyang Liu, Gauri Joshi

Abstract: The federated learning (FL) framework trains a machine learning model using decentralized data stored at edge client devices by periodically aggregating locally trained models. Popular optimization algorithms of FL use vanilla (stochastic) gradient descent for both local updates at clients and global updates at the aggregating server. Recently, adaptive optimization methods such as AdaGrad have be… ▽ More The federated learning (FL) framework trains a machine learning model using decentralized data stored at edge client devices by periodically aggregating locally trained models. Popular optimization algorithms of FL use vanilla (stochastic) gradient descent for both local updates at clients and global updates at the aggregating server. Recently, adaptive optimization methods such as AdaGrad have been studied for server updates. However, the effect of using adaptive optimization methods for local updates at clients is not yet understood. We show in both theory and practice that while local adaptive methods can accelerate convergence, they can cause a non-vanishing solution bias, where the final converged solution may be different from the stationary point of the global objective function. We propose correction techniques to overcome this inconsistency and complement the local adaptive methods for FL. Extensive experiments on realistic federated training tasks show that the proposed algorithms can achieve faster convergence and higher test accuracy than the baselines without local adaptivity. △ Less

Submitted 4 June, 2021; originally announced June 2021.

arXiv:2105.07878 [pdf]

doi 10.1109/ACCESS.2021.3070212.

A Review on Explainability in Multimodal Deep Neural Nets

Authors: Gargi Joshi, Rahee Walambe, Ketan Kotecha

Abstract: Artificial Intelligence techniques powered by deep neural nets have achieved much success in several application domains, most significantly and notably in the Computer Vision applications and Natural Language Processing tasks. Surpassing human-level performance propelled the research in the applications where different modalities amongst language, vision, sensory, text play an important role in a… ▽ More Artificial Intelligence techniques powered by deep neural nets have achieved much success in several application domains, most significantly and notably in the Computer Vision applications and Natural Language Processing tasks. Surpassing human-level performance propelled the research in the applications where different modalities amongst language, vision, sensory, text play an important role in accurate predictions and identification. Several multimodal fusion methods employing deep learning models are proposed in the literature. Despite their outstanding performance, the complex, opaque and black-box nature of the deep neural nets limits their social acceptance and usability. This has given rise to the quest for model interpretability and explainability, more so in the complex tasks involving multimodal AI methods. This paper extensively reviews the present literature to present a comprehensive survey and commentary on the explainability in multimodal deep neural nets, especially for the vision and language tasks. Several topics on multimodal AI and its applications for generic domains have been covered in this paper, including the significance, datasets, fundamental building blocks of the methods and techniques, challenges, applications, and future trends in this domain △ Less

Submitted 18 May, 2021; v1 submitted 17 May, 2021; originally announced May 2021.

Comments: 24 pages 6 figures

Journal ref: in IEEE Access, vol. 9, pp. 59800-59821, 2021

arXiv:2105.04699 [pdf, other]

Adaptive Policy Transfer in Reinforcement Learning

Authors: Girish Joshi, Girish Chowdhary

Abstract: Efficient and robust policy transfer remains a key challenge for reinforcement learning to become viable for real-wold robotics. Policy transfer through warm initialization, imitation, or interacting over a large set of agents with randomized instances, have been commonly applied to solve a variety of Reinforcement Learning tasks. However, this seems far from how skill transfer happens in the biol… ▽ More Efficient and robust policy transfer remains a key challenge for reinforcement learning to become viable for real-wold robotics. Policy transfer through warm initialization, imitation, or interacting over a large set of agents with randomized instances, have been commonly applied to solve a variety of Reinforcement Learning tasks. However, this seems far from how skill transfer happens in the biological world: Humans and animals are able to quickly adapt the learned behaviors between similar tasks and learn new skills when presented with new situations. Here we seek to answer the question: Will learning to combine adaptation and exploration lead to a more efficient transfer of policies between domains? We introduce a principled mechanism that can "Adapt-to-Learn", that is adapt the source policy to learn to solve a target task with significant transition differences and uncertainties. We show that the presented method learns to seamlessly combine learning from adaptation and exploration and leads to a robust policy transfer algorithm with significantly reduced sample complexity in transferring skills between related tasks. △ Less

Submitted 10 May, 2021; originally announced May 2021.

arXiv:2102.04487 [pdf, other]

Adaptive Quantization of Model Updates for Communication-Efficient Federated Learning

Authors: Divyansh Jhunjhunwala, Advait Gadhikar, Gauri Joshi, Yonina C. Eldar

Abstract: Communication of model updates between client nodes and the central aggregating server is a major bottleneck in federated learning, especially in bandwidth-limited settings and high-dimensional models. Gradient quantization is an effective way of reducing the number of bits required to communicate each model update, albeit at the cost of having a higher error floor due to the higher variance of th… ▽ More Communication of model updates between client nodes and the central aggregating server is a major bottleneck in federated learning, especially in bandwidth-limited settings and high-dimensional models. Gradient quantization is an effective way of reducing the number of bits required to communicate each model update, albeit at the cost of having a higher error floor due to the higher variance of the stochastic gradients. In this work, we propose an adaptive quantization strategy called AdaQuantFL that aims to achieve communication efficiency as well as a low error floor by changing the number of quantization levels during the course of training. Experiments on training deep neural networks show that our method can converge in much fewer communicated bits as compared to fixed quantization level setups, with little or no impact on training and test accuracy. △ Less

Submitted 8 February, 2021; originally announced February 2021.

Comments: Accepted to ICASSP 2021

arXiv:2012.13608 [pdf, other]

Synergy via Redundancy: Adaptive Replication Strategies and Fundamental Limits

Authors: Gauri Joshi, Dhruva Kaushal

Abstract: The maximum possible throughput (or the rate of job completion) of a multi-server system is typically the sum of the service rates of individual servers. Recent work shows that launching multiple replicas of a job and canceling them as soon as one copy finishes can boost the throughput, especially when the service time distribution has high variability. This means that redundancy can, in fact, cre… ▽ More The maximum possible throughput (or the rate of job completion) of a multi-server system is typically the sum of the service rates of individual servers. Recent work shows that launching multiple replicas of a job and canceling them as soon as one copy finishes can boost the throughput, especially when the service time distribution has high variability. This means that redundancy can, in fact, create synergy among servers such that their overall throughput is greater than the sum of individual servers. This work seeks to find the fundamental limit of the throughput boost achieved by job replication and the optimal replication policy to achieve it. While most previous works consider upfront replication policies, we expand the set of possible policies to delayed launch of replicas. The search for the optimal adaptive replication policy can be formulated as a Markov Decision Process, using which we propose two myopic replication policies, MaxRate and AdaRep, to adaptively replicate jobs. In order to quantify the optimality gap of these and other policies, we derive upper bounds on the service capacity, which provide fundamental limits on the throughput of queueing systems with redundancy. △ Less

Submitted 25 December, 2020; originally announced December 2020.

Comments: Accepted for publication in the IEEE/ACM Transactions on Networking

arXiv:2012.08009 [pdf, other]

Bandit-based Communication-Efficient Client Selection Strategies for Federated Learning

Authors: Yae Jee Cho, Samarth Gupta, Gauri Joshi, Osman Yağan

Abstract: Due to communication constraints and intermittent client availability in federated learning, only a subset of clients can participate in each training round. While most prior works assume uniform and unbiased client selection, recent work on biased client selection has shown that selecting clients with higher local losses can improve error convergence speed. However, previously proposed biased sel… ▽ More Due to communication constraints and intermittent client availability in federated learning, only a subset of clients can participate in each training round. While most prior works assume uniform and unbiased client selection, recent work on biased client selection has shown that selecting clients with higher local losses can improve error convergence speed. However, previously proposed biased selection strategies either require additional communication cost for evaluating the exact local loss or utilize stale local loss, which can even make the model diverge. In this paper, we present a bandit-based communication-efficient client selection strategy UCB-CS that achieves faster convergence with lower communication overhead. We also demonstrate how client selection can be used to improve fairness. △ Less

Submitted 14 December, 2020; originally announced December 2020.

arXiv:2011.02920 [pdf, other]

Asynchronous Deep Model Reference Adaptive Control

Authors: Girish Joshi, Jasvir Virdi, Girish Chowdhary

Abstract: In this paper, we present Asynchronous implementation of Deep Neural Network-based Model Reference Adaptive Control (DMRAC). We evaluate this new neuro-adaptive control architecture through flight tests on a small quadcopter. We demonstrate that a single DMRAC controller can handle significant nonlinearities due to severe system faults and deliberate wind disturbances while executing high-bandwidt… ▽ More In this paper, we present Asynchronous implementation of Deep Neural Network-based Model Reference Adaptive Control (DMRAC). We evaluate this new neuro-adaptive control architecture through flight tests on a small quadcopter. We demonstrate that a single DMRAC controller can handle significant nonlinearities due to severe system faults and deliberate wind disturbances while executing high-bandwidth attitude control. We also show that the architecture has long-term learning abilities across different flight regimes, and can generalize to fly different flight trajectories than those on which it was trained. These results demonstrating the efficacy of this architecture for high bandwidth closed-loop attitude control of unstable and nonlinear robots operating in adverse situations. To achieve these results, we designed a software+communication architecture to ensure online real-time inference of the deep network on a high-bandwidth computation-limited platform. We expect that this architecture will benefit other deep learning in the closed-loop experiments on robots. △ Less

Submitted 4 November, 2020; originally announced November 2020.

Comments: Accepted in CORL2020. arXiv admin note: text overlap with arXiv:1909.08602

arXiv:2010.01243 [pdf, other]

Client Selection in Federated Learning: Convergence Analysis and Power-of-Choice Selection Strategies

Authors: Yae Jee Cho, Jianyu Wang, Gauri Joshi

Abstract: Federated learning is a distributed optimization paradigm that enables a large number of resource-limited client nodes to cooperatively train a model without data sharing. Several works have analyzed the convergence of federated learning by accounting of data heterogeneity, communication and computation limitations, and partial client participation. However, they assume unbiased client participati… ▽ More Federated learning is a distributed optimization paradigm that enables a large number of resource-limited client nodes to cooperatively train a model without data sharing. Several works have analyzed the convergence of federated learning by accounting of data heterogeneity, communication and computation limitations, and partial client participation. However, they assume unbiased client participation, where clients are selected at random or in proportion of their data sizes. In this paper, we present the first convergence analysis of federated optimization for biased client selection strategies, and quantify how the selection bias affects convergence speed. We reveal that biasing client selection towards clients with higher local loss achieves faster error convergence. Using this insight, we propose Power-of-Choice, a communication- and computation-efficient client selection framework that can flexibly span the trade-off between convergence speed and solution bias. Our experiments demonstrate that Power-of-Choice strategies converge up to 3 $\times$ faster and give $10$% higher test accuracy than the baseline random selection. △ Less

Submitted 2 October, 2020; originally announced October 2020.

arXiv:2009.03988 [pdf, other]

A new architecture for hand-worn Sign language to Speech translator

Authors: Sai Charan Bodda, Palki Gupta, Gaurav Joshi, Ayush Chaturvedi

Abstract: People with speech and hearing impairments often rely on sign language to communicate with others but most of the general population cannot understand sign language and sign language itself is a difficult language to learn, so there is a definite need for technologies to translate sign language to speech. In this paper, we describe the design and implementation of Smart glove, a hand-worn hardware… ▽ More People with speech and hearing impairments often rely on sign language to communicate with others but most of the general population cannot understand sign language and sign language itself is a difficult language to learn, so there is a definite need for technologies to translate sign language to speech. In this paper, we describe the design and implementation of Smart glove, a hand-worn hardware device capable of translating American Sign Language gestures into English speech by tracking the finger's orientation, gestures and hand motion. It uses hardware sensors like Flex, Accelerometer and gyroscope and intelligent software to capture and translate the gestures into speech. This paper explains the translation of both Alphabet and Word gestures. New approaches and algorithms are proposed and implemented to address hardware-dependent issues in existing glove based designs. The whole device is designed to be modular with distributed processing units to encourage modular enhancement, reducing complexity, and interrelation between subsystems.Decision Trees are used in gesture recognition and error correction. We hope that the henceforth mentioned design and architecture would be the basis for the advancement in research related to sensor-based sign language translation along with research for smart glove and cybernetic accessories. △ Less

Submitted 8 September, 2020; originally announced September 2020.

Comments: 9 pages, 5 figures, research done as part of Samsung electronics

arXiv:2009.01598 [pdf, other]

Service Rate Region: A New Aspect of Coded Distributed System Design

Authors: Mehmet Aktas, Gauri Joshi, Swanand Kadhe, Fatemeh Kazemi, Emina Soljanin

Abstract: Erasure coding has been recognized as a powerful method to mitigate delays due to slow or straggling nodes in distributed systems. This work shows that erasure coding of data objects can flexibly handle skews in the request rates. Coding can help boost the \emph{service rate region}, that is, increase the overall volume of data access requests that the system can handle. This paper aims to postula… ▽ More Erasure coding has been recognized as a powerful method to mitigate delays due to slow or straggling nodes in distributed systems. This work shows that erasure coding of data objects can flexibly handle skews in the request rates. Coding can help boost the \emph{service rate region}, that is, increase the overall volume of data access requests that the system can handle. This paper aims to postulate the service rate region as an important consideration in the design of erasure-coded distributed systems. We highlight several open problems that can be grouped into two broad threads: 1) characterizing the service rate region of a given code and finding the optimal request allocation, and2) designing the underlying erasure code for a given service rate region. As contributions along the first thread, we characterize the rate regions of maximum-distance-separable, locally repairable, and Simplex codes. We show the effectiveness of hybrid codes that combine replication and erasure coding in terms of code design. We also discover fundamental connections between multi-set batch codes and the problem of maximizing the service rate region. △ Less

Submitted 27 June, 2021; v1 submitted 3 September, 2020; originally announced September 2020.

arXiv:2007.10800 [pdf, other]

Probabilistic Neighbourhood Component Analysis: Sample Efficient Uncertainty Estimation in Deep Learning

Authors: Ankur Mallick, Chaitanya Dwivedi, Bhavya Kailkhura, Gauri Joshi, T. Yong-Jin Han

Abstract: While Deep Neural Networks (DNNs) achieve state-of-the-art accuracy in various applications, they often fall short in accurately estimating their predictive uncertainty and, in turn, fail to recognize when these predictions may be wrong. Several uncertainty-aware models, such as Bayesian Neural Network (BNNs) and Deep Ensembles have been proposed in the literature for quantifying predictive uncert… ▽ More While Deep Neural Networks (DNNs) achieve state-of-the-art accuracy in various applications, they often fall short in accurately estimating their predictive uncertainty and, in turn, fail to recognize when these predictions may be wrong. Several uncertainty-aware models, such as Bayesian Neural Network (BNNs) and Deep Ensembles have been proposed in the literature for quantifying predictive uncertainty. However, research in this area has been largely confined to the big data regime. In this work, we show that the uncertainty estimation capability of state-of-the-art BNNs and Deep Ensemble models degrades significantly when the amount of training data is small. To address the issue of accurate uncertainty estimation in the small-data regime, we propose a probabilistic generalization of the popular sample-efficient non-parametric kNN approach. Our approach enables deep kNN classifier to accurately quantify underlying uncertainties in its prediction. We demonstrate the usefulness of the proposed approach by achieving superior uncertainty quantification as compared to state-of-the-art on a real-world application of COVID-19 diagnosis from chest X-Rays. Our code is available at https://github.com/ankurmallick/sample-efficient-uq △ Less

Submitted 18 July, 2020; originally announced July 2020.

arXiv:2007.07481 [pdf, other]

Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization

Authors: Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, H. Vincent Poor

Abstract: In federated optimization, heterogeneity in the clients' local datasets and computation speeds results in large variations in the number of local updates performed by each client in each communication round. Naive weighted aggregation of such models causes objective inconsistency, that is, the global model converges to a stationary point of a mismatched objective function which can be arbitrarily… ▽ More In federated optimization, heterogeneity in the clients' local datasets and computation speeds results in large variations in the number of local updates performed by each client in each communication round. Naive weighted aggregation of such models causes objective inconsistency, that is, the global model converges to a stationary point of a mismatched objective function which can be arbitrarily different from the true objective. This paper provides a general framework to analyze the convergence of federated heterogeneous optimization algorithms. It subsumes previously proposed methods such as FedAvg and FedProx and provides the first principled understanding of the solution bias and the convergence slowdown due to objective inconsistency. Using insights from this analysis, we propose FedNova, a normalized averaging method that eliminates objective inconsistency while preserving fast error convergence. △ Less

Submitted 15 July, 2020; originally announced July 2020.

arXiv:2003.10579 [pdf, other]

Slow and Stale Gradients Can Win the Race

Authors: Sanghamitra Dutta, Jianyu Wang, Gauri Joshi

Abstract: Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in runtime as it waits for the slowest workers (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness that can adversely affect the convergence error. In this work, we present a novel theoretical characterization of the speedup offered by asynchronous methods by an… ▽ More Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in runtime as it waits for the slowest workers (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness that can adversely affect the convergence error. In this work, we present a novel theoretical characterization of the speedup offered by asynchronous methods by analyzing the trade-off between the error in the trained model and the actual training runtime(wallclock time). The main novelty in our work is that our runtime analysis considers random straggling delays, which helps us design and compare distributed SGD algorithms that strike a balance between straggling and staleness. We also provide a new error convergence analysis of asynchronous SGD variants without bounded or exponential delay assumptions. Finally, based on our theoretical characterization of the error-runtime trade-off, we propose a method of gradually varying synchronicity in distributed SGD and demonstrate its performance on CIFAR10 dataset. △ Less

Submitted 23 March, 2020; originally announced March 2020.

Comments: Some of the results have appeared in AISTATS 2018. This is an extended version with additional results, in particular, an adaptive synchronicity strategy called AdaSync. arXiv admin note: substantial text overlap with arXiv:1803.01113

arXiv:2003.05649 [pdf, other]

Machine Learning on Volatile Instances

Authors: Xiaoxi Zhang, Jianyu Wang, Gauri Joshi, Carlee Joe-Wong

Abstract: Due to the massive size of the neural network models and training datasets used in machine learning today, it is imperative to distribute stochastic gradient descent (SGD) by splitting up tasks such as gradient evaluation across multiple worker nodes. However, running distributed SGD can be prohibitively expensive because it may require specialized computing resources such as GPUs for extended per… ▽ More Due to the massive size of the neural network models and training datasets used in machine learning today, it is imperative to distribute stochastic gradient descent (SGD) by splitting up tasks such as gradient evaluation across multiple worker nodes. However, running distributed SGD can be prohibitively expensive because it may require specialized computing resources such as GPUs for extended periods of time. We propose cost-effective strategies to exploit volatile cloud instances that are cheaper than standard instances, but may be interrupted by higher priority workloads. To the best of our knowledge, this work is the first to quantify how variations in the number of active worker nodes (as a result of preemption) affects SGD convergence and the time to train the model. By understanding these trade-offs between preemption probability of the instances, accuracy, and training time, we are able to derive practical strategies for configuring distributed SGD jobs on volatile instances such as Amazon EC2 spot instances and other preemptible cloud instances. Experimental results show that our strategies achieve good training performance at substantially lower cost. △ Less

Submitted 12 March, 2020; originally announced March 2020.

arXiv:2002.09539 [pdf, other]

Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD

Authors: Jianyu Wang, Hao Liang, Gauri Joshi

Abstract: Distributed stochastic gradient descent (SGD) is essential for scaling the machine learning algorithms to a large number of computing nodes. However, the infrastructures variability such as high communication delay or random node slowdown greatly impedes the performance of distributed SGD algorithm, especially in a wireless system or sensor networks. In this paper, we propose an algorithmic approa… ▽ More Distributed stochastic gradient descent (SGD) is essential for scaling the machine learning algorithms to a large number of computing nodes. However, the infrastructures variability such as high communication delay or random node slowdown greatly impedes the performance of distributed SGD algorithm, especially in a wireless system or sensor networks. In this paper, we propose an algorithmic approach named Overlap-Local-SGD (and its momentum variant) to overlap the communication and computation so as to speedup the distributed training procedure. The approach can help to mitigate the straggler effects as well. We achieve this by adding an anchor model on each node. After multiple local updates, locally trained models will be pulled back towards the synchronized anchor model rather than communicating with others. Experimental results of training a deep neural network on CIFAR-10 dataset demonstrate the effectiveness of Overlap-Local-SGD. We also provide a convergence guarantee for the proposed algorithm under non-convex objective functions. △ Less

Submitted 21 February, 2020; originally announced February 2020.

Comments: Accepted to ICASSP 2020

arXiv:1912.04977 [pdf, other]

Advances and Open Problems in Federated Learning

Authors: Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D'Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson , et al. (34 additional authors not shown)

Abstract: Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs re… ▽ More Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges. △ Less

Submitted 8 March, 2021; v1 submitted 10 December, 2019; originally announced December 2019.

Comments: Published in Foundations and Trends in Machine Learning Vol 4 Issue 1. See: https://www.nowpublishers.com/article/Details/MAL-083

arXiv:1911.03959 [pdf, other]

doi 10.1109/TIT.2021.3081508

Multi-Armed Bandits with Correlated Arms

Authors: Samarth Gupta, Shreyas Chaudhari, Gauri Joshi, Osman Yağan

Abstract: We consider a multi-armed bandit framework where the rewards obtained by pulling different arms are correlated. We develop a unified approach to leverage these reward correlations and present fundamental generalizations of classic bandit algorithms to the correlated setting. We present a unified proof technique to analyze the proposed algorithms. Rigorous analysis of C-UCB (the correlated bandit v… ▽ More We consider a multi-armed bandit framework where the rewards obtained by pulling different arms are correlated. We develop a unified approach to leverage these reward correlations and present fundamental generalizations of classic bandit algorithms to the correlated setting. We present a unified proof technique to analyze the proposed algorithms. Rigorous analysis of C-UCB (the correlated bandit version of Upper-confidence-bound) reveals that the algorithm ends up pulling certain sub-optimal arms, termed as non-competitive, only O(1) times, as opposed to the O(log T) pulls required by classic bandit algorithms such as UCB, TS etc. We present regret-lower bound and show that when arms are correlated through a latent random source, our algorithms obtain order-optimal regret. We validate the proposed algorithms via experiments on the MovieLens and Goodreads datasets, and show significant improvement over classical bandit algorithms. △ Less

Submitted 10 September, 2021; v1 submitted 6 November, 2019; originally announced November 2019.

Comments: A special case of the model studied in this paper is presented in arXiv:1808.05904

Journal ref: IEEE Transactions on Information Theory, 2021

arXiv:1910.05858 [pdf, other]

Deep Kernels with Probabilistic Embeddings for Small-Data Learning

Authors: Ankur Mallick, Chaitanya Dwivedi, Bhavya Kailkhura, Gauri Joshi, T. Yong-Jin Han

Abstract: Gaussian Processes (GPs) are known to provide accurate predictions and uncertainty estimates even with small amounts of labeled data by capturing similarity between data points through their kernel function. However traditional GP kernels are not very effective at capturing similarity between high dimensional data points. Neural networks can be used to learn good representations that encode intric… ▽ More Gaussian Processes (GPs) are known to provide accurate predictions and uncertainty estimates even with small amounts of labeled data by capturing similarity between data points through their kernel function. However traditional GP kernels are not very effective at capturing similarity between high dimensional data points. Neural networks can be used to learn good representations that encode intricate structures in high dimensional data, and can be used as inputs to the GP kernel. However the huge data requirement of neural networks makes this approach ineffective in small data settings. To solves the conflicting problems of representation learning and data efficiency, we propose to learn deep kernels on probabilistic embeddings by using a probabilistic neural network. Our approach maps high-dimensional data to a probability distribution in a low dimensional subspace and then computes a kernel between these distributions to capture similarity. To enable end-to-end learning, we derive a functional gradient descent procedure for training the model. Experiments on a variety of datasets show that our approach outperforms the state-of-the-art in GP kernel learning in both supervised and semi-supervised settings. We also extend our approach to other small-data paradigms such as few-shot classification where it outperforms previous approaches on mini-Imagenet and CUB datasets. △ Less

Submitted 13 November, 2021; v1 submitted 13 October, 2019; originally announced October 2019.

arXiv:1910.00762 [pdf, other]

Accelerating Deep Learning by Focusing on the Biggest Losers

Authors: Angela H. Jiang, Daniel L. -K. Wong, Giulio Zhou, David G. Andersen, Jeffrey Dean, Gregory R. Ganger, Gauri Joshi, Michael Kaminksy, Michael Kozuch, Zachary C. Lipton, Padmanabhan Pillai

Abstract: This paper introduces Selective-Backprop, a technique that accelerates the training of deep neural networks (DNNs) by prioritizing examples with high loss at each iteration. Selective-Backprop uses the output of a training example's forward pass to decide whether to use that example to compute gradients and update parameters, or to skip immediately to the next example. By reducing the number of co… ▽ More This paper introduces Selective-Backprop, a technique that accelerates the training of deep neural networks (DNNs) by prioritizing examples with high loss at each iteration. Selective-Backprop uses the output of a training example's forward pass to decide whether to use that example to compute gradients and update parameters, or to skip immediately to the next example. By reducing the number of computationally-expensive backpropagation steps performed, Selective-Backprop accelerates training. Evaluation on CIFAR10, CIFAR100, and SVHN, across a variety of modern image models, shows that Selective-Backprop converges to target error rates up to 3.5x faster than with standard SGD and between 1.02--1.8x faster than a state-of-the-art importance sampling approach. Further acceleration of 26% can be achieved by using stale forward pass results for selection, thus also skipping forward passes of low priority examples. △ Less

Submitted 1 October, 2019; originally announced October 2019.

arXiv:1909.08602 [pdf, other]

Deep Model Reference Adaptive Control

Authors: Girish Joshi, Girish Chowdhary

Abstract: We present a new neuroadaptive architecture: Deep Neural Network based Model Reference Adaptive Control (DMRAC). Our architecture utilizes the power of deep neural network representations for modeling significant nonlinearities while marrying it with the boundedness guarantees that characterize MRAC based controllers. We demonstrate through simulations and analysis that DMRAC can subsume previousl… ▽ More We present a new neuroadaptive architecture: Deep Neural Network based Model Reference Adaptive Control (DMRAC). Our architecture utilizes the power of deep neural network representations for modeling significant nonlinearities while marrying it with the boundedness guarantees that characterize MRAC based controllers. We demonstrate through simulations and analysis that DMRAC can subsume previously studied learning based MRAC methods, such as concurrent learning and GP-MRAC. This makes DMRAC a highly powerful architecture for high-performance control of nonlinear systems with long-term learning properties. △ Less

Submitted 18 September, 2019; originally announced September 2019.

Comments: Accepted in IEEE CDC-2019

Showing 1–50 of 74 results for author: Joshi, G