Search | arXiv e-print repository

Predictive Inference in Multi-environment Scenarios

Authors: John C. Duchi, Suyash Gupta, Kuanhao Jiang, Pragya Sur

Abstract: We address the challenge of constructing valid confidence intervals and sets in problems of prediction across multiple environments. We investigate two types of coverage suitable for these problems, extending the jackknife and split-conformal methods to show how to obtain distribution-free coverage in such non-traditional, hierarchical data-generating scenarios. Our contributions also include exte… ▽ More We address the challenge of constructing valid confidence intervals and sets in problems of prediction across multiple environments. We investigate two types of coverage suitable for these problems, extending the jackknife and split-conformal methods to show how to obtain distribution-free coverage in such non-traditional, hierarchical data-generating scenarios. Our contributions also include extensions for settings with non-real-valued responses and a theory of consistency for predictive inference in these general problems. We demonstrate a novel resizing method to adapt to problem difficulty, which applies both to existing approaches for predictive inference with hierarchical data and the methods we develop; this reduces prediction set sizes using limited information from the test environment, a key to the methods' practical performance, which we evaluate through neurochemical sensing and species classification datasets. △ Less

Submitted 24 March, 2024; originally announced March 2024.

arXiv:2402.08794 [pdf, ps, other]

An information-theoretic lower bound in time-uniform estimation

Authors: John C. Duchi, Saminul Haque

Abstract: We present an information-theoretic lower bound for the problem of parameter estimation with time-uniform coverage guarantees. Via a new a reduction to sequential testing, we obtain stronger lower bounds that capture the hardness of the time-uniform setting. In the case of location model estimation, logistic regression, and exponential family models, our $Ω(\sqrt{n^{-1}\log \log n})$ lower bound i… ▽ More We present an information-theoretic lower bound for the problem of parameter estimation with time-uniform coverage guarantees. Via a new a reduction to sequential testing, we obtain stronger lower bounds that capture the hardness of the time-uniform setting. In the case of location model estimation, logistic regression, and exponential family models, our $Ω(\sqrt{n^{-1}\log \log n})$ lower bound is sharp to within constant factors in typical settings. △ Less

Submitted 11 June, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

Comments: 16 pages

arXiv:2402.07131 [pdf, other]

Resampling methods for private statistical inference

Authors: Karan Chadha, John Duchi, Rohith Kuditipudi

Abstract: We consider the task of constructing confidence intervals with differential privacy. We propose two private variants of the non-parametric bootstrap, which privately compute the median of the results of multiple "little" bootstraps run on partitions of the data and give asymptotic bounds on the coverage error of the resulting confidence intervals. For a fixed differential privacy parameter $ε$, ou… ▽ More We consider the task of constructing confidence intervals with differential privacy. We propose two private variants of the non-parametric bootstrap, which privately compute the median of the results of multiple "little" bootstraps run on partitions of the data and give asymptotic bounds on the coverage error of the resulting confidence intervals. For a fixed differential privacy parameter $ε$, our methods enjoy the same error rates as that of the non-private bootstrap to within logarithmic factors in the sample size $n$. We empirically validate the performance of our methods for mean estimation, median estimation, and logistic regression with both real and synthetic data. Our methods achieve similar coverage accuracy to existing methods (and non-private baselines) while providing notably shorter ($\gtrsim 10$ times) confidence intervals than previous approaches. △ Less

Submitted 3 June, 2024; v1 submitted 11 February, 2024; originally announced February 2024.

Comments: 45 pages

arXiv:2311.01453 [pdf, other]

PPI++: Efficient Prediction-Powered Inference

Authors: Anastasios N. Angelopoulos, John C. Duchi, Tijana Zrnic

Abstract: We present PPI++: a computationally lightweight methodology for estimation and inference based on a small labeled dataset and a typically much larger dataset of machine-learning predictions. The methods automatically adapt to the quality of available predictions, yielding easy-to-compute confidence sets -- for parameters of any dimensionality -- that always improve on classical intervals using onl… ▽ More We present PPI++: a computationally lightweight methodology for estimation and inference based on a small labeled dataset and a typically much larger dataset of machine-learning predictions. The methods automatically adapt to the quality of available predictions, yielding easy-to-compute confidence sets -- for parameters of any dimensionality -- that always improve on classical intervals using only the labeled data. PPI++ builds on prediction-powered inference (PPI), which targets the same problem setting, improving its computational and statistical efficiency. Real and synthetic experiments demonstrate the benefits of the proposed adaptations. △ Less

Submitted 25 March, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

Comments: Code available at https://github.com/aangelopoulos/ppi_py

arXiv:2307.11947 [pdf, other]

Collaboratively Learning Linear Models with Structured Missing Data

Authors: Chen Cheng, Gary Cheng, John Duchi

Abstract: We study the problem of collaboratively learning least squares estimates for $m$ agents. Each agent observes a different subset of the features$\unicode{x2013}$e.g., containing data collected from sensors of varying resolution. Our goal is to determine how to coordinate the agents in order to produce the best estimator for each agent. We propose a distributed, semi-supervised algorithm Collab, con… ▽ More We study the problem of collaboratively learning least squares estimates for $m$ agents. Each agent observes a different subset of the features$\unicode{x2013}$e.g., containing data collected from sensors of varying resolution. Our goal is to determine how to coordinate the agents in order to produce the best estimator for each agent. We propose a distributed, semi-supervised algorithm Collab, consisting of three steps: local training, aggregation, and distribution. Our procedure does not require communicating the labeled data, making it communication efficient and useful in settings where the labeled data is inaccessible. Despite this handicap, our procedure is nearly asymptotically local minimax optimal$\unicode{x2013}$even among estimators allowed to communicate the labeled data such as imputation methods. We test our method on real and synthetic data. △ Less

Submitted 21 July, 2023; originally announced July 2023.

arXiv:2307.11749 [pdf, other]

Differentially Private Heavy Hitter Detection using Federated Analytics

Authors: Karan Chadha, Junye Chen, John Duchi, Vitaly Feldman, Hanieh Hashemi, Omid Javidbakht, Audra McMillan, Kunal Talwar

Abstract: In this work, we study practical heuristics to improve the performance of prefix-tree based algorithms for differentially private heavy hitter detection. Our model assumes each user has multiple data points and the goal is to learn as many of the most frequent data points as possible across all users' data with aggregate and local differential privacy. We propose an adaptive hyperparameter tuning… ▽ More In this work, we study practical heuristics to improve the performance of prefix-tree based algorithms for differentially private heavy hitter detection. Our model assumes each user has multiple data points and the goal is to learn as many of the most frequent data points as possible across all users' data with aggregate and local differential privacy. We propose an adaptive hyperparameter tuning algorithm that improves the performance of the algorithm while satisfying computational, communication and privacy constraints. We explore the impact of different data-selection schemes as well as the impact of introducing deny lists during multiple runs of the algorithm. We test these improvements using extensive experimentation on the Reddit dataset~\cite{caldas2018leaf} on the task of learning the most frequent words. △ Less

Submitted 21 July, 2023; originally announced July 2023.

arXiv:2301.07078 [pdf, ps, other]

A Fast Algorithm for Adaptive Private Mean Estimation

Authors: John Duchi, Saminul Haque, Rohith Kuditipudi

Abstract: We design an $(\varepsilon, δ)$-differentially private algorithm to estimate the mean of a $d$-variate distribution, with unknown covariance $Σ$, that is adaptive to $Σ$. To within polylogarithmic factors, the estimator achieves optimal rates of convergence with respect to the induced Mahalanobis norm $||\cdot||_Σ$, takes time $\tilde{O}(n d^2)$ to compute, has near linear sample complexity for su… ▽ More We design an $(\varepsilon, δ)$-differentially private algorithm to estimate the mean of a $d$-variate distribution, with unknown covariance $Σ$, that is adaptive to $Σ$. To within polylogarithmic factors, the estimator achieves optimal rates of convergence with respect to the induced Mahalanobis norm $||\cdot||_Σ$, takes time $\tilde{O}(n d^2)$ to compute, has near linear sample complexity for sub-Gaussian distributions, allows $Σ$ to be degenerate or low rank, and adaptively extends beyond sub-Gaussianity. Prior to this work, other methods required exponential computation time or the superlinear scaling $n = Ω(d^{3/2})$ to achieve non-trivial error with respect to the norm $||\cdot||_Σ$. △ Less

Submitted 17 January, 2023; originally announced January 2023.

Comments: 38 pages, no figures

arXiv:2211.10082 [pdf, other]

Private Federated Statistics in an Interactive Setting

Authors: Audra McMillan, Omid Javidbakht, Kunal Talwar, Elliot Briggs, Mike Chatzidakis, Junye Chen, John Duchi, Vitaly Feldman, Yusuf Goren, Michael Hesse, Vojta Jina, Anil Katti, Albert Liu, Cheney Lyford, Joey Meyer, Alex Palmer, David Park, Wonhee Park, Gianni Parsa, Paul Pelzl, Rehan Rishi, Congzheng Song, Shan Wang, Shundong Zhou

Abstract: Privately learning statistics of events on devices can enable improved user experience. Differentially private algorithms for such problems can benefit significantly from interactivity. We argue that an aggregation protocol can enable an interactive private federated statistics system where user's devices maintain control of the privacy assurance. We describe the architecture of such a system, and… ▽ More Privately learning statistics of events on devices can enable improved user experience. Differentially private algorithms for such problems can benefit significantly from interactivity. We argue that an aggregation protocol can enable an interactive private federated statistics system where user's devices maintain control of the privacy assurance. We describe the architecture of such a system, and analyze its security properties. △ Less

Submitted 18 November, 2022; originally announced November 2022.

arXiv:2210.17070 [pdf, ps, other]

Private optimization in the interpolation regime: faster rates and hardness results

Authors: Hilal Asi, Karan Chadha, Gary Cheng, John Duchi

Abstract: In non-private stochastic convex optimization, stochastic gradient methods converge much faster on interpolation problems -- problems where there exists a solution that simultaneously minimizes all of the sample losses -- than on non-interpolating ones; we show that generally similar improvements are impossible in the private setting. However, when the functions exhibit quadratic growth around the… ▽ More In non-private stochastic convex optimization, stochastic gradient methods converge much faster on interpolation problems -- problems where there exists a solution that simultaneously minimizes all of the sample losses -- than on non-interpolating ones; we show that generally similar improvements are impossible in the private setting. However, when the functions exhibit quadratic growth around the optimum, we show (near) exponential improvements in the private sample complexity. In particular, we propose an adaptive algorithm that improves the sample complexity to achieve expected error $α$ from $\frac{d}{\varepsilon \sqrtα}$ to $\frac{1}{α^ρ} + \frac{d}{\varepsilon} \log\left(\frac{1}α\right)$ for any fixed $ρ>0$, while retaining the standard minimax-optimal sample complexity for non-interpolation problems. We prove a lower bound that shows the dimension-dependent term is tight. Furthermore, we provide a superefficiency result which demonstrates the necessity of the polynomial term for adaptive algorithms: any algorithm that has a polylogarithmic sample complexity for interpolation problems cannot achieve the minimax-optimal rates for the family of non-interpolation problems. △ Less

Submitted 31 October, 2022; originally announced October 2022.

Comments: published at ICML 2022; 25 pages

arXiv:2210.13497 [pdf, other]

Subspace Recovery from Heterogeneous Data with Non-isotropic Noise

Authors: John Duchi, Vitaly Feldman, Lunjia Hu, Kunal Talwar

Abstract: Recovering linear subspaces from data is a fundamental and important task in statistics and machine learning. Motivated by heterogeneity in Federated Learning settings, we study a basic formulation of this problem: the principal component analysis (PCA), with a focus on dealing with irregular noise. Our data come from $n$ users with user $i$ contributing data samples from a $d$-dimensional distrib… ▽ More Recovering linear subspaces from data is a fundamental and important task in statistics and machine learning. Motivated by heterogeneity in Federated Learning settings, we study a basic formulation of this problem: the principal component analysis (PCA), with a focus on dealing with irregular noise. Our data come from $n$ users with user $i$ contributing data samples from a $d$-dimensional distribution with mean $μ_i$. Our goal is to recover the linear subspace shared by $μ_1,\ldots,μ_n$ using the data points from all users, where every data point from user $i$ is formed by adding an independent mean-zero noise vector to $μ_i$. If we only have one data point from every user, subspace recovery is information-theoretically impossible when the covariance matrices of the noise vectors can be non-spherical, necessitating additional restrictive assumptions in previous work. We avoid these assumptions by leveraging at least two data points from each user, which allows us to design an efficiently-computable estimator under non-spherical and user-dependent noise. We prove an upper bound for the estimation error of our estimator in general scenarios where the number of data points and amount of noise can vary across users, and prove an information-theoretic error lower bound that not only matches the upper bound up to a constant factor, but also holds even for spherical Gaussian noise. This implies that our estimator does not introduce additional estimation error (up to a constant factor) due to irregularity in the noise. We show additional results for a linear regression problem in a similar setup. △ Less

Submitted 24 October, 2022; originally announced October 2022.

Comments: In NeurIPS 2022

arXiv:2206.12041 [pdf, other]

How many labelers do you have? A closer look at gold-standard labels

Authors: Chen Cheng, Hilal Asi, John Duchi

Abstract: The construction of most supervised learning datasets revolves around collecting multiple labels for each instance, then aggregating the labels to form a type of "gold-standard". We question the wisdom of this pipeline by developing a (stylized) theoretical model of this process and analyzing its statistical consequences, showing how access to non-aggregated label information can make training wel… ▽ More The construction of most supervised learning datasets revolves around collecting multiple labels for each instance, then aggregating the labels to form a type of "gold-standard". We question the wisdom of this pipeline by developing a (stylized) theoretical model of this process and analyzing its statistical consequences, showing how access to non-aggregated label information can make training well-calibrated models more feasible than it is with gold-standard labels. The entire story, however, is subtle, and the contrasts between aggregated and fuller label information depend on the particulars of the problem, where estimators that use aggregated information exhibit robust but slower rates of convergence, while estimators that can effectively leverage all labels converge more quickly if they have fidelity to (or can learn) the true labeling process. The theory makes several predictions for real-world datasets, including when non-aggregate labels should improve learning performance, which we test to corroborate the validity of our predictions. △ Less

Submitted 4 June, 2024; v1 submitted 23 June, 2022; originally announced June 2022.

Comments: 63 pages, 8 figures

arXiv:2206.07236 [pdf, other]

Query-Adaptive Predictive Inference with Partial Labels

Authors: Maxime Cauchois, John Duchi

Abstract: The cost and scarcity of fully supervised labels in statistical machine learning encourage using partially labeled data for model validation as a cheaper and more accessible alternative. Effectively collecting and leveraging weakly supervised data for large-space structured prediction tasks thus becomes an important part of an end-to-end learning system. We propose a new computationally-friendly m… ▽ More The cost and scarcity of fully supervised labels in statistical machine learning encourage using partially labeled data for model validation as a cheaper and more accessible alternative. Effectively collecting and leveraging weakly supervised data for large-space structured prediction tasks thus becomes an important part of an end-to-end learning system. We propose a new computationally-friendly methodology to construct predictive sets using only partially labeled data on top of black-box predictive models. To do so, we introduce "probe" functions as a way to describe weakly supervised instances and define a false discovery proportion-type loss, both of which seamlessly adapt to partial supervision and structured prediction -- ranking, matching, segmentation, multilabel or multiclass classification. Our experiments highlight the validity of our predictive set construction as well as the attractiveness of a more flexible user-dependent loss framework. △ Less

Submitted 14 June, 2022; originally announced June 2022.

arXiv:2202.09889 [pdf, ps, other]

Memorize to Generalize: on the Necessity of Interpolation in High Dimensional Linear Regression

Authors: Chen Cheng, John Duchi, Rohith Kuditipudi

Abstract: We examine the necessity of interpolation in overparameterized models, that is, when achieving optimal predictive risk in machine learning problems requires (nearly) interpolating the training data. In particular, we consider simple overparameterized linear regression $y = X θ+ w$ with random design $X \in \mathbb{R}^{n \times d}$ under the proportional asymptotics $d/n \to γ\in (1, \infty)$. We p… ▽ More We examine the necessity of interpolation in overparameterized models, that is, when achieving optimal predictive risk in machine learning problems requires (nearly) interpolating the training data. In particular, we consider simple overparameterized linear regression $y = X θ+ w$ with random design $X \in \mathbb{R}^{n \times d}$ under the proportional asymptotics $d/n \to γ\in (1, \infty)$. We precisely characterize how prediction (test) error necessarily scales with training error in this setting. An implication of this characterization is that as the label noise variance $σ^2 \to 0$, any estimator that incurs at least $\mathsf{c}σ^4$ training error for some constant $\mathsf{c}$ is necessarily suboptimal and will suffer growth in excess prediction error at least linear in the training error. Thus, optimal performance requires fitting training data to substantially higher accuracy than the inherent noise floor of the problem. △ Less

Submitted 16 June, 2022; v1 submitted 20 February, 2022; originally announced February 2022.

Comments: 32 pages; accepted to the 35th Annual Conference on Learning Theory (COLT) 2022

arXiv:2201.08315 [pdf, other]

Predictive Inference with Weak Supervision

Authors: Maxime Cauchois, Suyash Gupta, Alnur Ali, John Duchi

Abstract: The expense of acquiring labels in large-scale statistical machine learning makes partially and weakly-labeled data attractive, though it is not always apparent how to leverage such data for model fitting or validation. We present a methodology to bridge the gap between partial supervision and validation, developing a conformal prediction framework to provide valid predictive confidence sets -- se… ▽ More The expense of acquiring labels in large-scale statistical machine learning makes partially and weakly-labeled data attractive, though it is not always apparent how to leverage such data for model fitting or validation. We present a methodology to bridge the gap between partial supervision and validation, developing a conformal prediction framework to provide valid predictive confidence sets -- sets that cover a true label with a prescribed probability, independent of the underlying distribution -- using weakly labeled data. To do so, we introduce a (necessary) new notion of coverage and predictive validity, then develop several application scenarios, providing efficient algorithms for classification and several large-scale structured prediction problems. We corroborate the hypothesis that the new coverage definition allows for tighter and more informative (but valid) confidence sets through several experiments. △ Less

Submitted 9 February, 2022; v1 submitted 20 January, 2022; originally announced January 2022.

arXiv:2108.07313 [pdf, other]

Federated Asymptotics: a model to compare federated learning algorithms

Authors: Gary Cheng, Karan Chadha, John Duchi

Abstract: We propose an asymptotic framework to analyze the performance of (personalized) federated learning algorithms. In this new framework, we formulate federated learning as a multi-criterion objective, where the goal is to minimize each client's loss using information from all of the clients. We analyze a linear regression model where, for a given client, we may theoretically compare the performance o… ▽ More We propose an asymptotic framework to analyze the performance of (personalized) federated learning algorithms. In this new framework, we formulate federated learning as a multi-criterion objective, where the goal is to minimize each client's loss using information from all of the clients. We analyze a linear regression model where, for a given client, we may theoretically compare the performance of various algorithms in the high-dimensional asymptotic limit. This asymptotic multi-criterion approach naturally models the high-dimensional, many-device nature of federated learning. These tools make fairly precise predictions about the benefits of personalization and information sharing in federated scenarios -- at least in our (stylized) model -- including that Federated Averaging with simple client fine-tuning achieves the same asymptotic risk as the more intricate meta-learning and proximal-regularized approaches and outperforming Federated Averaging without personalization. We evaluate these predictions on federated versions of the EMNIST, CIFAR-100, Shakespeare, and Stack Overflow datasets, where the experiments corroborate the theoretical predictions, suggesting such frameworks may provide a useful guide to practical algorithmic development. △ Less

Submitted 18 February, 2022; v1 submitted 16 August, 2021; originally announced August 2021.

Comments: 42 pages (11 main pages, 2 reference pages, 29 appendix pages), 13 figures

arXiv:2108.02391 [pdf, other]

Adapting to Function Difficulty and Growth Conditions in Private Optimization

Authors: Hilal Asi, Daniel Levy, John Duchi

Abstract: We develop algorithms for private stochastic convex optimization that adapt to the hardness of the specific function we wish to optimize. While previous work provide worst-case bounds for arbitrary convex functions, it is often the case that the function at hand belongs to a smaller class that enjoys faster rates. Concretely, we show that for functions exhibiting $κ$-growth around the optimum, i.e… ▽ More We develop algorithms for private stochastic convex optimization that adapt to the hardness of the specific function we wish to optimize. While previous work provide worst-case bounds for arbitrary convex functions, it is often the case that the function at hand belongs to a smaller class that enjoys faster rates. Concretely, we show that for functions exhibiting $κ$-growth around the optimum, i.e., $f(x) \ge f(x^*) + λκ^{-1} \|x-x^*\|_2^κ$ for $κ> 1$, our algorithms improve upon the standard ${\sqrt{d}}/{n\varepsilon}$ privacy rate to the faster $({\sqrt{d}}/{n\varepsilon})^{\tfracκ{κ- 1}}$. Crucially, they achieve these rates without knowledge of the growth constant $κ$ of the function. Our algorithms build upon the inverse sensitivity mechanism, which adapts to instance difficulty (Asi & Duchi, 2020), and recent localization techniques in private optimization (Feldman et al., 2020). We complement our algorithms with matching lower bounds for these function classes and demonstrate that our adaptive algorithm is \emph{simultaneously} (minimax) optimal over all $κ\ge 1+c$ whenever $c = Θ(1)$. △ Less

Submitted 5 August, 2021; originally announced August 2021.

Comments: 28 pages

arXiv:2106.13756 [pdf, other]

Private Adaptive Gradient Methods for Convex Optimization

Authors: Hilal Asi, John Duchi, Alireza Fallah, Omid Javidbakht, Kunal Talwar

Abstract: We study adaptive methods for differentially private convex optimization, proposing and analyzing differentially private variants of a Stochastic Gradient Descent (SGD) algorithm with adaptive stepsizes, as well as the AdaGrad algorithm. We provide upper bounds on the regret of both algorithms and show that the bounds are (worst-case) optimal. As a consequence of our development, we show that our… ▽ More We study adaptive methods for differentially private convex optimization, proposing and analyzing differentially private variants of a Stochastic Gradient Descent (SGD) algorithm with adaptive stepsizes, as well as the AdaGrad algorithm. We provide upper bounds on the regret of both algorithms and show that the bounds are (worst-case) optimal. As a consequence of our development, we show that our private versions of AdaGrad outperform adaptive SGD, which in turn outperforms traditional SGD in scenarios with non-isotropic gradients where (non-private) Adagrad provably outperforms SGD. The major challenge is that the isotropic noise typically added for privacy dominates the signal in gradient geometry for high-dimensional problems; approaches to this that effectively optimize over lower-dimensional subspaces simply ignore the actual problems that varying gradient geometries introduce. In contrast, we study non-isotropic clipping and noise addition, developing a principled theoretical approach; the consequent procedures also enjoy significantly stronger empirical performance than prior approaches. △ Less

Submitted 25 June, 2021; originally announced June 2021.

Comments: To appear in 38th International Conference on Machine Learning (ICML 2021)

arXiv:2101.05234 [pdf, other]

On Misspecification in Prediction Problems and Robustness via Improper Learning

Authors: Annie Marsden, John Duchi, Gregory Valiant

Abstract: We study probabilistic prediction games when the underlying model is misspecified, investigating the consequences of predicting using an incorrect parametric model. We show that for a broad class of loss functions and parametric families of distributions, the regret of playing a "proper" predictor -- one from the putative model class -- relative to the best predictor in the same model class has lo… ▽ More We study probabilistic prediction games when the underlying model is misspecified, investigating the consequences of predicting using an incorrect parametric model. We show that for a broad class of loss functions and parametric families of distributions, the regret of playing a "proper" predictor -- one from the putative model class -- relative to the best predictor in the same model class has lower bound scaling at least as $\sqrt{γn}$, where $γ$ is a measure of the model misspecification to the true distribution in terms of total variation distance. In contrast, using an aggregation-based (improper) learner, one can obtain regret $d \log n$ for any underlying generating distribution, where $d$ is the dimension of the parameter; we exhibit instances in which this is unimprovable even over the family of all learners that may play distributions in the convex hull of the parametric family. These results suggest that simple strategies for aggregating multiple learners together should be more robust, and several experiments conform to this hypothesis. △ Less

Submitted 29 January, 2021; v1 submitted 13 January, 2021; originally announced January 2021.

Comments: 28 pages, 6 figures

arXiv:2101.02696 [pdf, other]

Accelerated, Optimal, and Parallel: Some Results on Model-Based Stochastic Optimization

Authors: Karan Chadha, Gary Cheng, John C. Duchi

Abstract: We extend the Approximate-Proximal Point (aProx) family of model-based methods for solving stochastic convex optimization problems, including stochastic subgradient, proximal point, and bundle methods, to the minibatch and accelerated setting. To do so, we propose specific model-based algorithms and an acceleration scheme for which we provide non-asymptotic convergence guarantees, which are order-… ▽ More We extend the Approximate-Proximal Point (aProx) family of model-based methods for solving stochastic convex optimization problems, including stochastic subgradient, proximal point, and bundle methods, to the minibatch and accelerated setting. To do so, we propose specific model-based algorithms and an acceleration scheme for which we provide non-asymptotic convergence guarantees, which are order-optimal in all problem-dependent constants and provide linear speedup in minibatch size, while maintaining the desirable robustness traits (e.g. to stepsize) of the aProx family. Additionally, we show improved convergence rates and matching lower bounds identifying new fundamental constants for "interpolation" problems, whose importance in statistical machine learning is growing; this, for example, gives a parallelization strategy for alternating projections. We corroborate our theoretical results with empirical testing to demonstrate the gains accurate modeling, acceleration, and minibatching provide. △ Less

Submitted 7 January, 2021; originally announced January 2021.

Comments: 24 pages, 17 figures

arXiv:2010.05893 [pdf, other]

Large-Scale Methods for Distributionally Robust Optimization

Authors: Daniel Levy, Yair Carmon, John C. Duchi, Aaron Sidford

Abstract: We propose and analyze algorithms for distributionally robust optimization of convex losses with conditional value at risk (CVaR) and $χ^2$ divergence uncertainty sets. We prove that our algorithms require a number of gradient evaluations independent of training set size and number of parameters, making them suitable for large-scale applications. For $χ^2$ uncertainty sets these are the first such… ▽ More We propose and analyze algorithms for distributionally robust optimization of convex losses with conditional value at risk (CVaR) and $χ^2$ divergence uncertainty sets. We prove that our algorithms require a number of gradient evaluations independent of training set size and number of parameters, making them suitable for large-scale applications. For $χ^2$ uncertainty sets these are the first such guarantees in the literature, and for CVaR our guarantees scale linearly in the uncertainty level rather than quadratically as in previous work. We also provide lower bounds proving the worst-case optimality of our algorithms for CVaR and a penalized version of the $χ^2$ problem. Our primary technical contributions are novel bounds on the bias of batch robust risk estimation and the variance of a multilevel Monte Carlo gradient estimator due to [Blanchet & Glynn, 2015]. Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods. △ Less

Submitted 10 December, 2020; v1 submitted 12 October, 2020; originally announced October 2020.

Comments: 63 pages, NeurIPS 2020

arXiv:2008.10581 [pdf, other]

Neural Bridge Sampling for Evaluating Safety-Critical Autonomous Systems

Authors: Aman Sinha, Matthew O'Kelly, Russ Tedrake, John Duchi

Abstract: Learning-based methodologies increasingly find applications in safety-critical domains like autonomous driving and medical robotics. Due to the rare nature of dangerous events, real-world testing is prohibitively expensive and unscalable. In this work, we employ a probabilistic approach to safety evaluation in simulation, where we are concerned with computing the probability of dangerous events. W… ▽ More Learning-based methodologies increasingly find applications in safety-critical domains like autonomous driving and medical robotics. Due to the rare nature of dangerous events, real-world testing is prohibitively expensive and unscalable. In this work, we employ a probabilistic approach to safety evaluation in simulation, where we are concerned with computing the probability of dangerous events. We develop a novel rare-event simulation method that combines exploration, exploitation, and optimization techniques to find failure modes and estimate their rate of occurrence. We provide rigorous guarantees for the performance of our method in terms of both statistical and computational efficiency. Finally, we demonstrate the efficacy of our approach on a variety of scenarios, illustrating its usefulness as a tool for rapid sensitivity analysis and model comparison that are essential to developing and testing safety-critical autonomous systems. △ Less

Submitted 8 August, 2021; v1 submitted 24 August, 2020; originally announced August 2020.

Comments: NeurIPS 2020

arXiv:2008.04267 [pdf, other]

doi 10.1080/01621459.2023.2298037

Robust Validation: Confident Predictions Even When Distributions Shift

Authors: Maxime Cauchois, Suyash Gupta, Alnur Ali, John C. Duchi

Abstract: While the traditional viewpoint in machine learning and statistics assumes training and testing samples come from the same population, practice belies this fiction. One strategy -- coming from robust statistics and optimization -- is thus to build a model robust to distributional perturbations. In this paper, we take a different approach to describe procedures for robust predictive inference, wher… ▽ More While the traditional viewpoint in machine learning and statistics assumes training and testing samples come from the same population, practice belies this fiction. One strategy -- coming from robust statistics and optimization -- is thus to build a model robust to distributional perturbations. In this paper, we take a different approach to describe procedures for robust predictive inference, where a model provides uncertainty estimates on its predictions rather than point predictions. We present a method that produces prediction sets (almost exactly) giving the right coverage level for any test distribution in an $f$-divergence ball around the training population. The method, based on conformal inference, achieves (nearly) valid coverage in finite samples, under only the condition that the training data be exchangeable. An essential component of our methodology is to estimate the amount of expected future data shift and build robustness to it; we develop estimators and prove their consistency for protection and validity of uncertainty estimates under shifts. By experimenting on several large-scale benchmark datasets, including Recht et al.'s CIFAR-v4 and ImageNet-V2 datasets, we provide complementary empirical results that highlight the importance of robust predictive validity. △ Less

Submitted 4 July, 2024; v1 submitted 10 August, 2020; originally announced August 2020.

Comments: Published in the Journal of the American Statistical Association (JASA 2024)

arXiv:2007.13982 [pdf, other]

Distributionally Robust Losses for Latent Covariate Mixtures

Authors: John Duchi, Tatsunori Hashimoto, Hongseok Namkoong

Abstract: While modern large-scale datasets often consist of heterogeneous subpopulations -- for example, multiple demographic groups or multiple text corpora -- the standard practice of minimizing average loss fails to guarantee uniformly low losses across all subpopulations. We propose a convex procedure that controls the worst-case performance over all subpopulations of a given size. Our procedure comes… ▽ More While modern large-scale datasets often consist of heterogeneous subpopulations -- for example, multiple demographic groups or multiple text corpora -- the standard practice of minimizing average loss fails to guarantee uniformly low losses across all subpopulations. We propose a convex procedure that controls the worst-case performance over all subpopulations of a given size. Our procedure comes with finite-sample (nonparametric) convergence guarantees on the worst-off subpopulation. Empirically, we observe on lexical similarity, wine quality, and recidivism prediction tasks that our worst-case procedure learns models that do well against unseen subpopulations. △ Less

Submitted 10 August, 2022; v1 submitted 28 July, 2020; originally announced July 2020.

Comments: First released in 2019 on a personal website; published in Operations Research in 2022

arXiv:2006.13476 [pdf, other]

Second-Order Information in Non-Convex Stochastic Optimization: Power and Limitations

Authors: Yossi Arjevani, Yair Carmon, John C. Duchi, Dylan J. Foster, Ayush Sekhari, Karthik Sridharan

Abstract: We design an algorithm which finds an $ε$-approximate stationary point (with $\|\nabla F(x)\|\le ε$) using $O(ε^{-3})$ stochastic gradient and Hessian-vector products, matching guarantees that were previously available only under a stronger assumption of access to multiple queries with the same random seed. We prove a lower bound which establishes that this rate is optimal and---surprisingly---tha… ▽ More We design an algorithm which finds an $ε$-approximate stationary point (with $\|\nabla F(x)\|\le ε$) using $O(ε^{-3})$ stochastic gradient and Hessian-vector products, matching guarantees that were previously available only under a stronger assumption of access to multiple queries with the same random seed. We prove a lower bound which establishes that this rate is optimal and---surprisingly---that it cannot be improved using stochastic $p$th order methods for any $p\ge 2$, even when the first $p$ derivatives of the objective are Lipschitz. Together, these results characterize the complexity of non-convex stochastic optimization with second-order methods and beyond. Expanding our scope to the oracle complexity of finding $(ε,γ)$-approximate second-order stationary points, we establish nearly matching upper and lower bounds for stochastic second-order methods. Our lower bounds here are novel even in the noiseless case. △ Less

Submitted 24 June, 2020; originally announced June 2020.

Comments: Accepted to CONFERENCE ON LEARNING THEORY (COLT) 2020

arXiv:2005.10630 [pdf, other]

Near Instance-Optimality in Differential Privacy

Authors: Hilal Asi, John C. Duchi

Abstract: We develop two notions of instance optimality in differential privacy, inspired by classical statistical theory: one by defining a local minimax risk and the other by considering unbiased mechanisms and analogizing the Cramer-Rao bound, and we show that the local modulus of continuity of the estimand of interest completely determines these quantities. We also develop a complementary collection mec… ▽ More We develop two notions of instance optimality in differential privacy, inspired by classical statistical theory: one by defining a local minimax risk and the other by considering unbiased mechanisms and analogizing the Cramer-Rao bound, and we show that the local modulus of continuity of the estimand of interest completely determines these quantities. We also develop a complementary collection mechanisms, which we term the inverse sensitivity mechanisms, which are instance optimal (or nearly instance optimal) for a large class of estimands. Moreover, these mechanisms uniformly outperform the smooth sensitivity framework on each instance for several function classes of interest, including real-valued continuous functions. We carefully present two instantiations of the mechanisms for median and robust regression estimation with corresponding experiments. △ Less

Submitted 16 May, 2020; originally announced May 2020.

arXiv:2004.10181 [pdf, other]

Knowing what you know: valid and validated confidence sets in multiclass and multilabel prediction

Authors: Maxime Cauchois, Suyash Gupta, John Duchi

Abstract: We develop conformal prediction methods for constructing valid predictive confidence sets in multiclass and multilabel problems without assumptions on the data generating distribution. A challenge here is that typical conformal prediction methods---which give marginal validity (coverage) guarantees---provide uneven coverage, in that they address easy examples at the expense of essentially ignoring… ▽ More We develop conformal prediction methods for constructing valid predictive confidence sets in multiclass and multilabel problems without assumptions on the data generating distribution. A challenge here is that typical conformal prediction methods---which give marginal validity (coverage) guarantees---provide uneven coverage, in that they address easy examples at the expense of essentially ignoring difficult examples. By leveraging ideas from quantile regression, we build methods that always guarantee correct coverage but additionally provide (asymptotically optimal) conditional coverage for both multiclass and multilabel prediction problems. To address the potential challenge of exponentially large confidence sets in multilabel prediction, we build tree-structured classifiers that efficiently account for interactions between labels. Our methods can be bolted on top of any classification model---neural network, random forest, boosted tree---to guarantee its validity. We also provide an empirical evaluation, simultaneously providing new validation methods, that suggests the more robust coverage of our confidence sets. △ Less

Submitted 10 July, 2020; v1 submitted 21 April, 2020; originally announced April 2020.

Comments: Updated section on multilabel settings addressing the cases when labels may repel each other

arXiv:2003.03900 [pdf, other]

FormulaZero: Distributionally Robust Online Adaptation via Offline Population Synthesis

Authors: Aman Sinha, Matthew O'Kelly, Hongrui Zheng, Rahul Mangharam, John Duchi, Russ Tedrake

Abstract: Balancing performance and safety is crucial to deploying autonomous vehicles in multi-agent environments. In particular, autonomous racing is a domain that penalizes safe but conservative policies, highlighting the need for robust, adaptive strategies. Current approaches either make simplifying assumptions about other agents or lack robust mechanisms for online adaptation. This work makes algorith… ▽ More Balancing performance and safety is crucial to deploying autonomous vehicles in multi-agent environments. In particular, autonomous racing is a domain that penalizes safe but conservative policies, highlighting the need for robust, adaptive strategies. Current approaches either make simplifying assumptions about other agents or lack robust mechanisms for online adaptation. This work makes algorithmic contributions to both challenges. First, to generate a realistic, diverse set of opponents, we develop a novel method for self-play based on replica-exchange Markov chain Monte Carlo. Second, we propose a distributionally robust bandit optimization procedure that adaptively adjusts risk aversion relative to uncertainty in beliefs about opponents' behaviors. We rigorously quantify the tradeoffs in performance and robustness when approximating these computations in real-time motion-planning, and we demonstrate our methods experimentally on autonomous vehicles that achieve scaled speeds comparable to Formula One racecars. △ Less

Submitted 22 August, 2020; v1 submitted 8 March, 2020; originally announced March 2020.

Comments: ICML 2020: https://icml.cc/virtual/2020/poster/6277

arXiv:2002.10716 [pdf, other]

Understanding and Mitigating the Tradeoff Between Robustness and Accuracy

Authors: Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John Duchi, Percy Liang

Abstract: Adversarial training augments the training set with perturbations to improve the robust error (over worst-case perturbations), but it often leads to an increase in the standard error (on unperturbed test inputs). Previous explanations for this tradeoff rely on the assumption that no predictor in the hypothesis class has low standard and robust error. In this work, we precisely characterize the eff… ▽ More Adversarial training augments the training set with perturbations to improve the robust error (over worst-case perturbations), but it often leads to an increase in the standard error (on unperturbed test inputs). Previous explanations for this tradeoff rely on the assumption that no predictor in the hypothesis class has low standard and robust error. In this work, we precisely characterize the effect of augmentation on the standard error in linear regression when the optimal linear predictor has zero standard and robust error. In particular, we show that the standard error could increase even when the augmented perturbations have noiseless observations from the optimal linear predictor. We then prove that the recently proposed robust self-training (RST) estimator improves robust error without sacrificing standard error for noiseless linear regression. Empirically, for neural networks, we find that RST with different adversarial training methods improves both standard and robust error for random and adversarial rotations and adversarial $\ell_\infty$ perturbations in CIFAR-10. △ Less

Submitted 6 July, 2020; v1 submitted 25 February, 2020; originally announced February 2020.

Comments: Appearing at International Conference on Machine Learning (ICML) 2020

arXiv:1912.04042 [pdf, other]

Element Level Differential Privacy: The Right Granularity of Privacy

Authors: Hilal Asi, John Duchi, Omid Javidbakht

Abstract: Differential Privacy (DP) provides strong guarantees on the risk of compromising a user's data in statistical learning applications, though these strong protections make learning challenging and may be too stringent for some use cases. To address this, we propose element level differential privacy, which extends differential privacy to provide protection against leaking information about any parti… ▽ More Differential Privacy (DP) provides strong guarantees on the risk of compromising a user's data in statistical learning applications, though these strong protections make learning challenging and may be too stringent for some use cases. To address this, we propose element level differential privacy, which extends differential privacy to provide protection against leaking information about any particular "element" a user has, allowing better utility and more robust results than classical DP. By carefully choosing these "elements," it is possible to provide privacy protections at a desired granularity. We provide definitions, associated privacy guarantees, and analysis to identify the tradeoffs with the new definition; we also develop several private estimation and learning methodologies, providing careful examples for item frequency and M-estimation (empirical risk minimization) with concomitant privacy and utility analysis. We complement our theoretical and methodological advances with several real-world applications, estimating histograms and fitting several large-scale prediction models, including deep networks. △ Less

Submitted 5 December, 2019; originally announced December 2019.

Comments: 34 pages, 5 figures

arXiv:1912.02365 [pdf, other]

Lower Bounds for Non-Convex Stochastic Optimization

Authors: Yossi Arjevani, Yair Carmon, John C. Duchi, Dylan J. Foster, Nathan Srebro, Blake Woodworth

Abstract: We lower bound the complexity of finding $ε$-stationary points (with gradient norm at most $ε$) using stochastic first-order methods. In a well-studied model where algorithms access smooth, potentially non-convex functions through queries to an unbiased stochastic gradient oracle with bounded variance, we prove that (in the worst case) any algorithm requires at least $ε^{-4}$ queries to find an… ▽ More We lower bound the complexity of finding $ε$-stationary points (with gradient norm at most $ε$) using stochastic first-order methods. In a well-studied model where algorithms access smooth, potentially non-convex functions through queries to an unbiased stochastic gradient oracle with bounded variance, we prove that (in the worst case) any algorithm requires at least $ε^{-4}$ queries to find an $ε$ stationary point. The lower bound is tight, and establishes that stochastic gradient descent is minimax optimal in this model. In a more restrictive model where the noisy gradient estimates satisfy a mean-squared smoothness property, we prove a lower bound of $ε^{-3}$ queries, establishing the optimality of recently proposed variance reduction techniques. △ Less

Submitted 27 February, 2022; v1 submitted 4 December, 2019; originally announced December 2019.

Comments: Correction to hard instance dimensions in Theorem 3

arXiv:1909.10455 [pdf, other]

Necessary and Sufficient Geometries for Gradient Methods

Authors: Daniel Levy, John C. Duchi

Abstract: We study the impact of the constraint set and gradient geometry on the convergence of online and stochastic methods for convex optimization, providing a characterization of the geometries for which stochastic gradient and adaptive gradient methods are (minimax) optimal. In particular, we show that when the constraint set is quadratically convex, diagonally pre-conditioned stochastic gradient metho… ▽ More We study the impact of the constraint set and gradient geometry on the convergence of online and stochastic methods for convex optimization, providing a characterization of the geometries for which stochastic gradient and adaptive gradient methods are (minimax) optimal. In particular, we show that when the constraint set is quadratically convex, diagonally pre-conditioned stochastic gradient methods are minimax optimal. We further provide a converse that shows that when the constraints are not quadratically convex---for example, any $\ell_p$-ball for $p < 2$---the methods are far from optimal. Based on this, we can provide concrete recommendations for when one should use adaptive, mirror or stochastic gradient methods. △ Less

Submitted 28 October, 2019; v1 submitted 23 September, 2019; originally announced September 2019.

Comments: 23 pages. To appear at NeurIPS 2019

arXiv:1906.06032 [pdf, other]

Adversarial Training Can Hurt Generalization

Authors: Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John C. Duchi, Percy Liang

Abstract: While adversarial training can improve robust accuracy (against an adversary), it sometimes hurts standard accuracy (when there is no adversary). Previous work has studied this tradeoff between standard and robust accuracy, but only in the setting where no predictor performs well on both objectives in the infinite data limit. In this paper, we show that even when the optimal predictor with infinit… ▽ More While adversarial training can improve robust accuracy (against an adversary), it sometimes hurts standard accuracy (when there is no adversary). Previous work has studied this tradeoff between standard and robust accuracy, but only in the setting where no predictor performs well on both objectives in the infinite data limit. In this paper, we show that even when the optimal predictor with infinite data performs well on both objectives, a tradeoff can still manifest itself with finite data. Furthermore, since our construction is based on a convex learning problem, we rule out optimization concerns, thus laying bare a fundamental tension between robustness and generalization. Finally, we show that robust self-training mostly eliminates this tradeoff by leveraging unlabeled data. △ Less

Submitted 26 August, 2019; v1 submitted 14 June, 2019; originally announced June 2019.

arXiv:1905.13736 [pdf, other]

Unlabeled Data Improves Adversarial Robustness

Authors: Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, Percy Liang, John C. Duchi

Abstract: We demonstrate, theoretically and empirically, that adversarial robustness can significantly benefit from semisupervised learning. Theoretically, we revisit the simple Gaussian model of Schmidt et al. that shows a sample complexity gap between standard and robust classification. We prove that unlabeled data bridges this gap: a simple semisupervised learning procedure (self-training) achieves high… ▽ More We demonstrate, theoretically and empirically, that adversarial robustness can significantly benefit from semisupervised learning. Theoretically, we revisit the simple Gaussian model of Schmidt et al. that shows a sample complexity gap between standard and robust classification. We prove that unlabeled data bridges this gap: a simple semisupervised learning procedure (self-training) achieves high robust accuracy using the same number of labels required for achieving high standard accuracy. Empirically, we augment CIFAR-10 with 500K unlabeled images sourced from 80 Million Tiny Images and use robust self-training to outperform state-of-the-art robust accuracies by over 5 points in (i) $\ell_\infty$ robustness against several strong attacks via adversarial training and (ii) certified $\ell_2$ and $\ell_\infty$ robustness via randomized smoothing. On SVHN, adding the dataset's own extra training set with the labels removed provides gains of 4 to 10 points, within 1 point of the gain from using the extra labels. △ Less

Submitted 13 January, 2022; v1 submitted 31 May, 2019; originally announced May 2019.

Comments: Corrected some math typos in the proof of Lemma 1

arXiv:1903.02675 [pdf, other]

A Rank-1 Sketch for Matrix Multiplicative Weights

Authors: Yair Carmon, John C. Duchi, Aaron Sidford, Kevin Tian

Abstract: We show that a simple randomized sketch of the matrix multiplicative weight (MMW) update enjoys (in expectation) the same regret bounds as MMW, up to a small constant factor. Unlike MMW, where every step requires full matrix exponentiation, our steps require only a single product of the form $e^A b$, which the Lanczos method approximates efficiently. Our key technique is to view the sketch as a… ▽ More We show that a simple randomized sketch of the matrix multiplicative weight (MMW) update enjoys (in expectation) the same regret bounds as MMW, up to a small constant factor. Unlike MMW, where every step requires full matrix exponentiation, our steps require only a single product of the form $e^A b$, which the Lanczos method approximates efficiently. Our key technique is to view the sketch as a $\textit{randomized mirror projection}$, and perform mirror descent analysis on the $\textit{expected projection}$. Our sketch solves the online eigenvector problem, improving the best known complexity bounds by $Ω(\log^5 n)$. We also apply this sketch to semidefinite programming in saddle-point form, yielding a simple primal-dual scheme with guarantees matching the best in the literature. △ Less

Submitted 12 August, 2019; v1 submitted 6 March, 2019; originally announced March 2019.

arXiv:1903.00184 [pdf, other]

Proximal algorithms for constrained composite optimization, with applications to solving low-rank SDPs

Authors: Yu Bai, John Duchi, Song Mei

Abstract: We study a family of (potentially non-convex) constrained optimization problems with convex composite structure. Through a novel analysis of non-smooth geometry, we show that proximal-type algorithms applied to exact penalty formulations of such problems exhibit local linear convergence under a quadratic growth condition, which the compositional structure we consider ensures. The main application… ▽ More We study a family of (potentially non-convex) constrained optimization problems with convex composite structure. Through a novel analysis of non-smooth geometry, we show that proximal-type algorithms applied to exact penalty formulations of such problems exhibit local linear convergence under a quadratic growth condition, which the compositional structure we consider ensures. The main application of our results is to low-rank semidefinite optimization with Burer-Monteiro factorizations. We precisely identify the conditions for quadratic growth in the factorized problem via structures in the semidefinite problem, which could be of independent interest for understanding matrix factorization. △ Less

Submitted 1 March, 2019; originally announced March 2019.

arXiv:1901.03403 [pdf, ps, other]

doi 10.1109/TIT.2022.3174409

Mean Estimation from One-Bit Measurements

Authors: Alon Kipnis, John C. Duchi

Abstract: We consider the problem of estimating the mean of a symmetric log-concave distribution under the constraint that only a single bit per sample from this distribution is available to the estimator. We study the mean squared error as a function of the sample size (and hence the number of bits). We consider three settings: first, a centralized setting, where an encoder may release $n$ bits given a sam… ▽ More We consider the problem of estimating the mean of a symmetric log-concave distribution under the constraint that only a single bit per sample from this distribution is available to the estimator. We study the mean squared error as a function of the sample size (and hence the number of bits). We consider three settings: first, a centralized setting, where an encoder may release $n$ bits given a sample of size $n$, and for which there is no asymptotic penalty for quantization; second, an adaptive setting in which each bit is a function of the current observation and previously recorded bits, where we show that the optimal relative efficiency compared to the sample mean is precisely the efficiency of the median; lastly, we show that in a distributed setting where each bit is only a function of a local sample, no estimator can achieve optimal efficiency uniformly over the parameter space. We additionally complement our results in the adaptive setting by showing that \emph{one} round of adaptivity is sufficient to achieve optimal mean-square error. △ Less

Submitted 9 May, 2022; v1 submitted 10 January, 2019; originally announced January 2019.

Comments: Accepted for publication in the IEEE Transactions on Information Theory

Journal ref: IEEE Transactions on Information Theory ( Volume: 68, Issue: 9, September 2022)

arXiv:1812.00984 [pdf, other]

Protection Against Reconstruction and Its Applications in Private Federated Learning

Authors: Abhishek Bhowmick, John Duchi, Julien Freudiger, Gaurav Kapoor, Ryan Rogers

Abstract: In large-scale statistical learning, data collection and model fitting are moving increasingly toward peripheral devices---phones, watches, fitness trackers---away from centralized data collection. Concomitant with this rise in decentralized data are increasing challenges of maintaining privacy while allowing enough information to fit accurate, useful statistical models. This motivates local notio… ▽ More In large-scale statistical learning, data collection and model fitting are moving increasingly toward peripheral devices---phones, watches, fitness trackers---away from centralized data collection. Concomitant with this rise in decentralized data are increasing challenges of maintaining privacy while allowing enough information to fit accurate, useful statistical models. This motivates local notions of privacy---most significantly, local differential privacy, which provides strong protections against sensitive data disclosures---where data is obfuscated before a statistician or learner can even observe it, providing strong protections to individuals' data. Yet local privacy as traditionally employed may prove too stringent for practical use, especially in modern high-dimensional statistical and machine learning problems. Consequently, we revisit the types of disclosures and adversaries against which we provide protections, considering adversaries with limited prior information and ensuring that with high probability, ensuring they cannot reconstruct an individual's data within useful tolerances. By reconceptualizing these protections, we allow more useful data release---large privacy parameters in local differential privacy---and we design new (minimax) optimal locally differentially private mechanisms for statistical learning problems for \emph{all} privacy levels. We thus present practicable approaches to large-scale locally private model training that were previously impossible, showing theoretically and empirically that we can fit large-scale image classification and language models with little degradation in utility. △ Less

Submitted 3 June, 2019; v1 submitted 3 December, 2018; originally announced December 2018.

arXiv:1811.00145 [pdf, ps, other]

Scalable End-to-End Autonomous Vehicle Testing via Rare-event Simulation

Authors: Matthew O'Kelly, Aman Sinha, Hongseok Namkoong, John Duchi, Russ Tedrake

Abstract: While recent developments in autonomous vehicle (AV) technology highlight substantial progress, we lack tools for rigorous and scalable testing. Real-world testing, the $\textit{de facto}$ evaluation environment, places the public in danger, and, due to the rare nature of accidents, will require billions of miles in order to statistically validate performance claims. We implement a simulation fram… ▽ More While recent developments in autonomous vehicle (AV) technology highlight substantial progress, we lack tools for rigorous and scalable testing. Real-world testing, the $\textit{de facto}$ evaluation environment, places the public in danger, and, due to the rare nature of accidents, will require billions of miles in order to statistically validate performance claims. We implement a simulation framework that can test an entire modern autonomous driving system, including, in particular, systems that employ deep-learning perception and control algorithms. Using adaptive importance-sampling methods to accelerate rare-event probability evaluation, we estimate the probability of an accident under a base distribution governing standard traffic behavior. We demonstrate our framework on a highway scenario, accelerating system evaluation by $2$-$20$ times over naive Monte Carlo sampling methods and $10$-$300 \mathsf{P}$ times (where $\mathsf{P}$ is the number of processors) over real-world testing. △ Less

Submitted 12 January, 2019; v1 submitted 31 October, 2018; originally announced November 2018.

Comments: NeurIPS 2018

arXiv:1810.08750 [pdf, other]

Learning Models with Uniform Performance via Distributionally Robust Optimization

Authors: John Duchi, Hongseok Namkoong

Abstract: A common goal in statistics and machine learning is to learn models that can perform well against distributional shifts, such as latent heterogeneous subpopulations, unknown covariate shifts, or unmodeled temporal effects. We develop and analyze a distributionally robust stochastic optimization (DRO) framework that learns a model providing good performance against perturbations to the data-generat… ▽ More A common goal in statistics and machine learning is to learn models that can perform well against distributional shifts, such as latent heterogeneous subpopulations, unknown covariate shifts, or unmodeled temporal effects. We develop and analyze a distributionally robust stochastic optimization (DRO) framework that learns a model providing good performance against perturbations to the data-generating distribution. We give a convex formulation for the problem, providing several convergence guarantees. We prove finite-sample minimax upper and lower bounds, showing that distributional robustness sometimes comes at a cost in convergence rates. We give limit theorems for the learned parameters, where we fully specify the limiting distribution so that confidence intervals can be computed. On real tasks including generalizing to unknown subpopulations, fine-grained recognition, and providing good tail performance, the distributionally robust approach often exhibits improved performance. △ Less

Submitted 17 July, 2020; v1 submitted 19 October, 2018; originally announced October 2018.

arXiv:1806.05756 [pdf, other]

The Right Complexity Measure in Locally Private Estimation: It is not the Fisher Information

Authors: John C. Duchi, Feng Ruan

Abstract: We identify fundamental tradeoffs between statistical utility and privacy under local models of privacy in which data is kept private even from the statistician, providing instance-specific bounds for private estimation and learning problems by developing the \emph{local minimax risk}. In contrast to approaches based on worst-case (minimax) error, which are conservative, this allows us to evaluate… ▽ More We identify fundamental tradeoffs between statistical utility and privacy under local models of privacy in which data is kept private even from the statistician, providing instance-specific bounds for private estimation and learning problems by developing the \emph{local minimax risk}. In contrast to approaches based on worst-case (minimax) error, which are conservative, this allows us to evaluate the difficulty of individual problem instances and delineate the possibilities for adaptation in private estimation and inference. Our main results show that the local modulus of continuity of the estimand with respect to the variation distance---as opposed to the Hellinger distance central to classical statistics---characterizes rates of convergence under locally private estimation for many notions of privacy, including differential privacy and its relaxations. As consequences of these results, we identify an alternative to the Fisher information for private estimation, giving a more nuanced understanding of the challenges of adaptivity and optimality. △ Less

Submitted 29 September, 2020; v1 submitted 14 June, 2018; originally announced June 2018.

arXiv:1805.12018 [pdf, other]

Generalizing to Unseen Domains via Adversarial Data Augmentation

Authors: Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John Duchi, Vittorio Murino, Silvio Savarese

Abstract: We are concerned with learning models that generalize well to different \emph{unseen} domains. We consider a worst-case formulation over data distributions that are near the source domain in the feature space. Only using training data from a single source distribution, we propose an iterative procedure that augments the dataset with examples from a fictitious target domain that is "hard" under the… ▽ More We are concerned with learning models that generalize well to different \emph{unseen} domains. We consider a worst-case formulation over data distributions that are near the source domain in the feature space. Only using training data from a single source distribution, we propose an iterative procedure that augments the dataset with examples from a fictitious target domain that is "hard" under the current model. We show that our iterative scheme is an adaptive data augmentation method where we append adversarial examples at each iteration. For softmax losses, we show that our method is a data-dependent regularization scheme that behaves differently from classical regularizers that regularize towards zero (e.g., ridge or lasso). On digit recognition and semantic segmentation tasks, our method learns models improve performance across a range of a priori unknown target domains. △ Less

Submitted 6 November, 2018; v1 submitted 30 May, 2018; originally announced May 2018.

Comments: Accepted to NIPS 2018 (camera ready)

arXiv:1804.03761 [pdf, other]

Derivative free optimization via repeated classification

Authors: Tatsunori B. Hashimoto, Steve Yadlowsky, John C. Duchi

Abstract: We develop an algorithm for minimizing a function using $n$ batched function value measurements at each of $T$ rounds by using classifiers to identify a function's sublevel set. We show that sufficiently accurate classifiers can achieve linear convergence rates, and show that the convergence rate is tied to the difficulty of active learning sublevel sets. Further, we show that the bootstrap is a c… ▽ More We develop an algorithm for minimizing a function using $n$ batched function value measurements at each of $T$ rounds by using classifiers to identify a function's sublevel set. We show that sufficiently accurate classifiers can achieve linear convergence rates, and show that the convergence rate is tied to the difficulty of active learning sublevel sets. Further, we show that the bootstrap is a computationally efficient approximation to the necessary classification scheme. The end result is a computationally efficient derivative-free algorithm requiring no tuning that consistently outperforms other approaches on simulations, standard benchmarks, real-world DNA binding optimization, and airfoil design problems whenever batched function queries are natural. △ Less

Submitted 10 April, 2018; originally announced April 2018.

Comments: At AISTATS2018

arXiv:1710.10571 [pdf, ps, other]

Certifying Some Distributional Robustness with Principled Adversarial Training

Authors: Aman Sinha, Hongseok Namkoong, Riccardo Volpi, John Duchi

Abstract: Neural networks are vulnerable to adversarial examples and researchers have proposed many heuristic attack and defense mechanisms. We address this problem through the principled lens of distributionally robust optimization, which guarantees performance under adversarial input perturbations. By considering a Lagrangian penalty formulation of perturbing the underlying data distribution in a Wasserst… ▽ More Neural networks are vulnerable to adversarial examples and researchers have proposed many heuristic attack and defense mechanisms. We address this problem through the principled lens of distributionally robust optimization, which guarantees performance under adversarial input perturbations. By considering a Lagrangian penalty formulation of perturbing the underlying data distribution in a Wasserstein ball, we provide a training procedure that augments model parameter updates with worst-case perturbations of training data. For smooth losses, our procedure provably achieves moderate levels of robustness with little computational or statistical cost relative to empirical risk minimization. Furthermore, our statistical guarantees allow us to efficiently certify robustness for the population loss. For imperceptible perturbations, our method matches or outperforms heuristic approaches. △ Less

Submitted 1 May, 2020; v1 submitted 29 October, 2017; originally announced October 2017.

Comments: ICLR 2018: https://openreview.net/forum?id=Hk6kPgZA-

arXiv:1705.02356 [pdf, other]

Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval

Authors: John C. Duchi, Feng Ruan

Abstract: We develop procedures, based on minimization of the composition $f(x) = h(c(x))$ of a convex function $h$ and smooth function $c$, for solving random collections of quadratic equalities, applying our methodology to phase retrieval problems. We show that the prox-linear algorithm we develop can solve phase retrieval problems---even with adversarially faulty measurements---with high probability as s… ▽ More We develop procedures, based on minimization of the composition $f(x) = h(c(x))$ of a convex function $h$ and smooth function $c$, for solving random collections of quadratic equalities, applying our methodology to phase retrieval problems. We show that the prox-linear algorithm we develop can solve phase retrieval problems---even with adversarially faulty measurements---with high probability as soon as the number of measurements $m$ is a constant factor larger than the dimension $n$ of the signal to be recovered. The algorithm requires essentially no tuning---it consists of solving a sequence of convex problems---and it is implementable without any particular assumptions on the measurements taken. We provide substantial experiments investigating our methods, indicating the practical effectiveness of the procedures and showing that they succeed with high probability as soon as $m / n \ge 2$ when the signal is real-valued. △ Less

Submitted 22 April, 2018; v1 submitted 5 May, 2017; originally announced May 2017.

Comments: 55 pages, 9 figures

arXiv:1612.00547 [pdf, other]

Gradient Descent Finds the Cubic-Regularized Non-Convex Newton Step

Authors: Yair Carmon, John C. Duchi

Abstract: We consider the minimization of non-convex quadratic forms regularized by a cubic term, which exhibit multiple saddle points and poor local minima. Nonetheless, we prove that, under mild assumptions, gradient descent approximates the $\textit{global minimum}$ to within $\varepsilon$ accuracy in $O(\varepsilon^{-1}\log(1/\varepsilon))$ steps for large $\varepsilon$ and $O(\log(1/\varepsilon))$ step… ▽ More We consider the minimization of non-convex quadratic forms regularized by a cubic term, which exhibit multiple saddle points and poor local minima. Nonetheless, we prove that, under mild assumptions, gradient descent approximates the $\textit{global minimum}$ to within $\varepsilon$ accuracy in $O(\varepsilon^{-1}\log(1/\varepsilon))$ steps for large $\varepsilon$ and $O(\log(1/\varepsilon))$ steps for small $\varepsilon$ (compared to a condition number we define), with at most logarithmic dependence on the problem dimension. When we use gradient descent to approximate the cubic-regularized Newton step, our result implies a rate of convergence to second-order stationary points of general smooth non-convex functions. △ Less

Submitted 30 August, 2022; v1 submitted 1 December, 2016; originally announced December 2016.

Comments: Corrected Lemma 4.6(iii) and changed the title and some notation to match the journal version of the paper

arXiv:1611.00756 [pdf, other]

Accelerated Methods for Non-Convex Optimization

Authors: Yair Carmon, John C. Duchi, Oliver Hinder, Aaron Sidford

Abstract: We present an accelerated gradient method for non-convex optimization problems with Lipschitz continuous first and second derivatives. The method requires time $O(ε^{-7/4} \log(1/ ε) )$ to find an $ε$-stationary point, meaning a point $x$ such that $\|\nabla f(x)\| \le ε$. The method improves upon the $O(ε^{-2} )$ complexity of gradient descent and provides the additional second-order guarantee th… ▽ More We present an accelerated gradient method for non-convex optimization problems with Lipschitz continuous first and second derivatives. The method requires time $O(ε^{-7/4} \log(1/ ε) )$ to find an $ε$-stationary point, meaning a point $x$ such that $\|\nabla f(x)\| \le ε$. The method improves upon the $O(ε^{-2} )$ complexity of gradient descent and provides the additional second-order guarantee that $\nabla^2 f(x) \succeq -O(ε^{1/2})I$ for the computed $x$. Furthermore, our method is Hessian free, i.e. it only requires gradient computations, and is therefore suitable for large scale applications. △ Less

Submitted 2 February, 2017; v1 submitted 2 November, 2016; originally announced November 2016.

arXiv:1608.03100 [pdf, other]

Estimation from Indirect Supervision with Linear Moments

Authors: Aditi Raghunathan, Roy Frostig, John Duchi, Percy Liang

Abstract: In structured prediction problems where we have indirect supervision of the output, maximum marginal likelihood faces two computational obstacles: non-convexity of the objective and intractability of even a single gradient computation. In this paper, we bypass both obstacles for a class of what we call linear indirectly-supervised problems. Our approach is simple: we solve a linear system to estim… ▽ More In structured prediction problems where we have indirect supervision of the output, maximum marginal likelihood faces two computational obstacles: non-convexity of the objective and intractability of even a single gradient computation. In this paper, we bypass both obstacles for a class of what we call linear indirectly-supervised problems. Our approach is simple: we solve a linear system to estimate sufficient statistics of the model, which we then use to estimate parameters via convex optimization. We analyze the statistical properties of our approach and show empirically that it is effective in two settings: learning with local privacy constraints and learning from low-cost count-based annotations. △ Less

Submitted 10 August, 2016; originally announced August 2016.

Comments: 12 pages, 7 figures, extended and updated version of our paper appearing in ICML 2016

arXiv:1604.02390 [pdf, ps, other]

Minimax Optimal Procedures for Locally Private Estimation

Authors: John Duchi, Martin Wainwright, Michael Jordan

Abstract: Working under a model of privacy in which data remains private even from the statistician, we study the tradeoff between privacy guarantees and the risk of the resulting statistical estimators. We develop private versions of classical information-theoretic bounds, in particular those due to Le Cam, Fano, and Assouad. These inequalities allow for a precise characterization of statistical rates unde… ▽ More Working under a model of privacy in which data remains private even from the statistician, we study the tradeoff between privacy guarantees and the risk of the resulting statistical estimators. We develop private versions of classical information-theoretic bounds, in particular those due to Le Cam, Fano, and Assouad. These inequalities allow for a precise characterization of statistical rates under local privacy constraints and the development of provably (minimax) optimal estimation procedures. We provide a treatment of several canonical families of problems: mean estimation and median estimation, generalized linear models, and nonparametric density estimation. For all of these families, we provide lower and upper bounds that match up to constant factors, and exhibit new (optimal) privacy-preserving mechanisms and computationally efficient estimators that achieve the bounds. Additionally, we present a variety of experimental results for estimation problems involving sensitive data, including salaries, censored blog posts and articles, and drug abuse; these experiments demonstrate the importance of deriving optimal procedures. △ Less

Submitted 14 November, 2017; v1 submitted 8 April, 2016; originally announced April 2016.

Comments: 64 pages, 8 figures. arXiv admin note: substantial text overlap with arXiv:1302.3203

arXiv:1603.00126 [pdf, ps, other]

Multiclass Classification, Information, Divergence, and Surrogate Risk

Authors: John C. Duchi, Khashayar Khosravi, Feng Ruan

Abstract: We provide a unifying view of statistical information measures, multi-way Bayesian hypothesis testing, loss functions for multi-class classification problems, and multi-distribution $f$-divergences, elaborating equivalence results between all of these objects, and extending existing results for binary outcome spaces to more general ones. We consider a generalization of $f$-divergences to multiple… ▽ More We provide a unifying view of statistical information measures, multi-way Bayesian hypothesis testing, loss functions for multi-class classification problems, and multi-distribution $f$-divergences, elaborating equivalence results between all of these objects, and extending existing results for binary outcome spaces to more general ones. We consider a generalization of $f$-divergences to multiple distributions, and we provide a constructive equivalence between divergences, statistical information (in the sense of DeGroot), and losses for multiclass classification. A major application of our results is in multi-class classification problems in which we must both infer a discriminant function $γ$---for making predictions on a label $Y$ from datum $X$---and a data representation (or, in the setting of a hypothesis testing problem, an experimental design), represented as a quantizer $\mathsf{q}$ from a family of possible quantizers $\mathsf{Q}$. In this setting, we characterize the equivalence between loss functions, meaning that optimizing either of two losses yields an optimal discriminant and quantizer $\mathsf{q}$, complementing and extending earlier results of Nguyen et. al. to the multiclass case. Our results provide a more substantial basis than standard classification calibration results for comparing different losses: we describe the convex losses that are consistent for jointly choosing a data representation and minimizing the (weighted) probability of error in multiclass classification problems. △ Less

Submitted 10 September, 2017; v1 submitted 29 February, 2016; originally announced March 2016.

arXiv:1412.4451 [pdf, ps, other]

Privacy and Statistical Risk: Formalisms and Minimax Bounds

Authors: Rina Foygel Barber, John C. Duchi

Abstract: We explore and compare a variety of definitions for privacy and disclosure limitation in statistical estimation and data analysis, including (approximate) differential privacy, testing-based definitions of privacy, and posterior guarantees on disclosure risk. We give equivalence results between the definitions, shedding light on the relationships between different formalisms for privacy. We also t… ▽ More We explore and compare a variety of definitions for privacy and disclosure limitation in statistical estimation and data analysis, including (approximate) differential privacy, testing-based definitions of privacy, and posterior guarantees on disclosure risk. We give equivalence results between the definitions, shedding light on the relationships between different formalisms for privacy. We also take an inferential perspective, where---building off of these definitions---we provide minimax risk bounds for several estimation problems, including mean estimation, estimation of the support of a distribution, and nonparametric density estimation. These bounds highlight the statistical consequences of different definitions of privacy and provide a second lens for evaluating the advantages and disadvantages of different techniques for disclosure limitation. △ Less

Submitted 14 December, 2014; originally announced December 2014.

Comments: 29 pages

Showing 1–50 of 63 results for author: Duchi, J