-
Fine Grained Lower Bounds for Multidimensional Knapsack
Authors:
Ilan Doron-Arad,
Ariel Kulik,
Pasin Manurangsi
Abstract:
We study the $d$-dimensional knapsack problem. We are given a set of items, each with a $d$-dimensional cost vector and a profit, along with a $d$-dimensional budget vector. The goal is to select a set of items that do not exceed the budget in all dimensions and maximize the total profit. A PTAS with running time $n^{Θ(d/\varepsilon)}$ has long been known for this problem, where $\varepsilon$ is t…
▽ More
We study the $d$-dimensional knapsack problem. We are given a set of items, each with a $d$-dimensional cost vector and a profit, along with a $d$-dimensional budget vector. The goal is to select a set of items that do not exceed the budget in all dimensions and maximize the total profit. A PTAS with running time $n^{Θ(d/\varepsilon)}$ has long been known for this problem, where $\varepsilon$ is the error parameter and $n$ is the encoding size. Despite decades of active research, the best running time of a PTAS has remained $O(n^{\lceil d/\varepsilon \rceil - d})$. Unfortunately, existing lower bounds only cover the special case with two dimensions $d = 2$, and do not answer whether there is a $n^{o(d/\varepsilon)}$-time PTAS for larger values of $d$. The status of exact algorithms is similar: there is a simple $O(n \cdot W^d)$-time (exact) dynamic programming algorithm, where $W$ is the maximum budget, but there is no lower bound which explains the strong exponential dependence on $d$.
In this work, we show that the running times of the best-known PTAS and exact algorithm cannot be improved up to a polylogarithmic factor assuming Gap-ETH. Our techniques are based on a robust reduction from 2-CSP, which embeds 2-CSP constraints into a desired number of dimensions, exhibiting tight trade-off between $d$ and $\varepsilon$ for most regimes of the parameters. Informally, we obtain the following main results for $d$-dimensional knapsack.
No $n^{o(d/\varepsilon \cdot 1/(\log(d/\varepsilon))^2)}$-time $(1-\varepsilon)$-approximation for every $\varepsilon = O(1/\log d)$.
No $(n+W)^{o(d/\log d)}$-time exact algorithm (assuming ETH).
No $n^{o(\sqrt{d})}$-time $(1-\varepsilon)$-approximation for constant $\varepsilon$.
$(d \cdot \log W)^{O(d^2)} + n^{O(1)}$-time $Ω(1/\sqrt{d})$-approximation and a matching $n^{O(1)}$-time lower~bound.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
On Equivalence of Parameterized Inapproximability of k-Median, k-Max-Coverage, and 2-CSP
Authors:
Karthik C. S.,
Euiwoong Lee,
Pasin Manurangsi
Abstract:
Parameterized Inapproximability Hypothesis (PIH) is a central question in the field of parameterized complexity. PIH asserts that given as input a 2-CSP on $k$ variables and alphabet size $n$, it is W[1]-hard parameterized by $k$ to distinguish if the input is perfectly satisfiable or if every assignment to the input violates 1% of the constraints.
An important implication of PIH is that it yiel…
▽ More
Parameterized Inapproximability Hypothesis (PIH) is a central question in the field of parameterized complexity. PIH asserts that given as input a 2-CSP on $k$ variables and alphabet size $n$, it is W[1]-hard parameterized by $k$ to distinguish if the input is perfectly satisfiable or if every assignment to the input violates 1% of the constraints.
An important implication of PIH is that it yields the tight parameterized inapproximability of the $k$-maxcoverage problem. In the $k$-maxcoverage problem, we are given as input a set system, a threshold $τ>0$, and a parameter $k$ and the goal is to determine if there exist $k$ sets in the input whose union is at least $τ$ fraction of the entire universe. PIH is known to imply that it is W[1]-hard parameterized by $k$ to distinguish if there are $k$ input sets whose union is at least $τ$ fraction of the universe or if the union of every $k$ input sets is not much larger than $τ\cdot (1-\frac{1}{e})$ fraction of the universe.
In this work we present a gap preserving FPT reduction (in the reverse direction) from the $k$-maxcoverage problem to the aforementioned 2-CSP problem, thus showing that the assertion that approximating the $k$-maxcoverage problem to some constant factor is W[1]-hard implies PIH. In addition, we present a gap preserving FPT reduction from the $k$-median problem (in general metrics) to the $k$-maxcoverage problem, further highlighting the power of gap preserving FPT reductions over classical gap preserving polynomial time reductions.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
On Convex Optimization with Semi-Sensitive Features
Authors:
Badih Ghazi,
Pritish Kamath,
Ravi Kumar,
Pasin Manurangsi,
Raghu Meka,
Chiyuan Zhang
Abstract:
We study the differentially private (DP) empirical risk minimization (ERM) problem under the semi-sensitive DP setting where only some features are sensitive. This generalizes the Label DP setting where only the label is sensitive. We give improved upper and lower bounds on the excess risk for DP-ERM. In particular, we show that the error only scales polylogarithmically in terms of the sensitive d…
▽ More
We study the differentially private (DP) empirical risk minimization (ERM) problem under the semi-sensitive DP setting where only some features are sensitive. This generalizes the Label DP setting where only the label is sensitive. We give improved upper and lower bounds on the excess risk for DP-ERM. In particular, we show that the error only scales polylogarithmically in terms of the sensitive domain size, improving upon previous results that scale polynomially in the sensitive domain size (Ghazi et al., 2021).
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
On Computing Pairwise Statistics with Local Differential Privacy
Authors:
Badih Ghazi,
Pritish Kamath,
Ravi Kumar,
Pasin Manurangsi,
Adam Sealfon
Abstract:
We study the problem of computing pairwise statistics, i.e., ones of the form $\binom{n}{2}^{-1} \sum_{i \ne j} f(x_i, x_j)$, where $x_i$ denotes the input to the $i$th user, with differential privacy (DP) in the local model. This formulation captures important metrics such as Kendall's $τ$ coefficient, Area Under Curve, Gini's mean difference, Gini's entropy, etc. We give several novel and generi…
▽ More
We study the problem of computing pairwise statistics, i.e., ones of the form $\binom{n}{2}^{-1} \sum_{i \ne j} f(x_i, x_j)$, where $x_i$ denotes the input to the $i$th user, with differential privacy (DP) in the local model. This formulation captures important metrics such as Kendall's $τ$ coefficient, Area Under Curve, Gini's mean difference, Gini's entropy, etc. We give several novel and generic algorithms for the problem, leveraging techniques from DP algorithms for linear queries.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models
Authors:
Lynn Chua,
Badih Ghazi,
Yangsibo Huang,
Pritish Kamath,
Ravi Kumar,
Pasin Manurangsi,
Amer Sinha,
Chulin Xie,
Chiyuan Zhang
Abstract:
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, effectively being crosslingual? This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks. We observe that while these models show promising surface-level crosslingual abilities on machine translation…
▽ More
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, effectively being crosslingual? This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks. We observe that while these models show promising surface-level crosslingual abilities on machine translation and embedding space analyses, they struggle with deeper crosslingual knowledge transfer, revealing a crosslingual knowledge barrier in both general (MMLU benchmark) and domain-specific (Harry Potter quiz) contexts. We observe that simple inference-time mitigation methods offer only limited improvement. On the other hand, we propose fine-tuning of LLMs on mixed-language data, which effectively reduces these gaps, even when using out-of-domain datasets like WikiText. Our findings suggest the need for explicit optimization to unlock the full crosslingual potential of LLMs. Our code is publicly available at https://github.com/google-research/crosslingual-knowledge-barriers.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Mind the Privacy Unit! User-Level Differential Privacy for Language Model Fine-Tuning
Authors:
Lynn Chua,
Badih Ghazi,
Yangsibo Huang,
Pritish Kamath,
Ravi Kumar,
Daogao Liu,
Pasin Manurangsi,
Amer Sinha,
Chiyuan Zhang
Abstract:
Large language models (LLMs) have emerged as powerful tools for tackling complex tasks across diverse domains, but they also raise privacy concerns when fine-tuned on sensitive data due to potential memorization. While differential privacy (DP) offers a promising solution by ensuring models are 'almost indistinguishable' with or without any particular privacy unit, current evaluations on LLMs most…
▽ More
Large language models (LLMs) have emerged as powerful tools for tackling complex tasks across diverse domains, but they also raise privacy concerns when fine-tuned on sensitive data due to potential memorization. While differential privacy (DP) offers a promising solution by ensuring models are 'almost indistinguishable' with or without any particular privacy unit, current evaluations on LLMs mostly treat each example (text record) as the privacy unit. This leads to uneven user privacy guarantees when contributions per user vary. We therefore study user-level DP motivated by applications where it necessary to ensure uniform privacy protection across users. We present a systematic evaluation of user-level DP for LLM fine-tuning on natural language generation tasks. Focusing on two mechanisms for achieving user-level DP guarantees, Group Privacy and User-wise DP-SGD, we investigate design choices like data selection strategies and parameter tuning for the best privacy-utility tradeoff.
△ Less
Submitted 16 August, 2024; v1 submitted 20 June, 2024;
originally announced June 2024.
-
Individualized Privacy Accounting via Subsampling with Applications in Combinatorial Optimization
Authors:
Badih Ghazi,
Pritish Kamath,
Ravi Kumar,
Pasin Manurangsi,
Adam Sealfon
Abstract:
In this work, we give a new technique for analyzing individualized privacy accounting via the following simple observation: if an algorithm is one-sided add-DP, then its subsampled variant satisfies two-sided DP. From this, we obtain several improved algorithms for private combinatorial optimization problems, including decomposable submodular maximization and set cover. Our error guarantees are as…
▽ More
In this work, we give a new technique for analyzing individualized privacy accounting via the following simple observation: if an algorithm is one-sided add-DP, then its subsampled variant satisfies two-sided DP. From this, we obtain several improved algorithms for private combinatorial optimization problems, including decomposable submodular maximization and set cover. Our error guarantees are asymptotically tight and our algorithm satisfies pure-DP while previously known algorithms (Gupta et al., 2010; Chaturvedi et al., 2021) are approximate-DP. We also show an application of our technique beyond combinatorial optimization by giving a pure-DP algorithm for the shifting heavy hitter problem in a stream; previously, only an approximateDP algorithm was known (Kaplan et al., 2021; Cohen & Lyu, 2023).
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Complexity of Round-Robin Allocation with Potentially Noisy Queries
Authors:
Zihan Li,
Pasin Manurangsi,
Jonathan Scarlett,
Warut Suksompong
Abstract:
We study the complexity of a fundamental algorithm for fairly allocating indivisible items, the round-robin algorithm. For $n$ agents and $m$ items, we show that the algorithm can be implemented in time $O(nm\log(m/n))$ in the worst case. If the agents' preferences are uniformly random, we establish an improved (expected) running time of $O(nm + m\log m)$. On the other hand, assuming comparison qu…
▽ More
We study the complexity of a fundamental algorithm for fairly allocating indivisible items, the round-robin algorithm. For $n$ agents and $m$ items, we show that the algorithm can be implemented in time $O(nm\log(m/n))$ in the worst case. If the agents' preferences are uniformly random, we establish an improved (expected) running time of $O(nm + m\log m)$. On the other hand, assuming comparison queries between items, we prove that $Ω(nm + m\log m)$ queries are necessary to implement the algorithm, even when randomization is allowed. We also derive bounds in noise models where the answers to queries are incorrect with some probability. Our proofs involve novel applications of tools from multi-armed bandit, information theory, as well as posets and linear extensions.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
Ordinal Maximin Guarantees for Group Fair Division
Authors:
Pasin Manurangsi,
Warut Suksompong
Abstract:
We investigate fairness in the allocation of indivisible items among groups of agents using the notion of maximin share (MMS). While previous work has shown that no nontrivial multiplicative MMS approximation can be guaranteed in this setting for general group sizes, we demonstrate that ordinal relaxations are much more useful. For example, we show that if $n$ agents are distributed equally across…
▽ More
We investigate fairness in the allocation of indivisible items among groups of agents using the notion of maximin share (MMS). While previous work has shown that no nontrivial multiplicative MMS approximation can be guaranteed in this setting for general group sizes, we demonstrate that ordinal relaxations are much more useful. For example, we show that if $n$ agents are distributed equally across $g$ groups, there exists a $1$-out-of-$k$ MMS allocation for $k = O(g\log(n/g))$, while if all but a constant number of agents are in the same group, we obtain $k = O(\log n/\log \log n)$. We also establish the tightness of these bounds and provide non-asymptotic results for the case of two groups.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Differentially Private Optimization with Sparse Gradients
Authors:
Badih Ghazi,
Cristóbal Guzmán,
Pritish Kamath,
Ravi Kumar,
Pasin Manurangsi
Abstract:
Motivated by applications of large embedding models, we study differentially private (DP) optimization problems under sparsity of individual gradients. We start with new near-optimal bounds for the classic mean estimation problem but with sparse data, improving upon existing algorithms particularly for the high-dimensional regime. Building on this, we obtain pure- and approximate-DP algorithms wit…
▽ More
Motivated by applications of large embedding models, we study differentially private (DP) optimization problems under sparsity of individual gradients. We start with new near-optimal bounds for the classic mean estimation problem but with sparse data, improving upon existing algorithms particularly for the high-dimensional regime. Building on this, we obtain pure- and approximate-DP algorithms with almost optimal rates for stochastic convex optimization with sparse gradients; the former represents the first nearly dimension-independent rates for this problem. Finally, we study the approximation of stationary points for the empirical loss in approximate-DP optimization and obtain rates that depend on sparsity instead of dimension, modulo polylogarithmic factors.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
How Private are DP-SGD Implementations?
Authors:
Lynn Chua,
Badih Ghazi,
Pritish Kamath,
Ravi Kumar,
Pasin Manurangsi,
Amer Sinha,
Chiyuan Zhang
Abstract:
We demonstrate a substantial gap between the privacy guarantees of the Adaptive Batch Linear Queries (ABLQ) mechanism under different types of batch sampling: (i) Shuffling, and (ii) Poisson subsampling; the typical analysis of Differentially Private Stochastic Gradient Descent (DP-SGD) follows by interpreting it as a post-processing of ABLQ. While shuffling-based DP-SGD is more commonly used in p…
▽ More
We demonstrate a substantial gap between the privacy guarantees of the Adaptive Batch Linear Queries (ABLQ) mechanism under different types of batch sampling: (i) Shuffling, and (ii) Poisson subsampling; the typical analysis of Differentially Private Stochastic Gradient Descent (DP-SGD) follows by interpreting it as a post-processing of ABLQ. While shuffling-based DP-SGD is more commonly used in practical implementations, it has not been amenable to easy privacy analysis, either analytically or even numerically. On the other hand, Poisson subsampling-based DP-SGD is challenging to scalably implement, but has a well-understood privacy analysis, with multiple open-source numerically tight privacy accountants available. This has led to a common practice of using shuffling-based DP-SGD in practice, but using the privacy analysis for the corresponding Poisson subsampling version. Our result shows that there can be a substantial gap between the privacy analysis when using the two types of batch sampling, and thus advises caution in reporting privacy parameters for DP-SGD.
△ Less
Submitted 6 June, 2024; v1 submitted 26 March, 2024;
originally announced March 2024.
-
Differentially Private Ad Conversion Measurement
Authors:
John Delaney,
Badih Ghazi,
Charlie Harrison,
Christina Ilvento,
Ravi Kumar,
Pasin Manurangsi,
Martin Pal,
Karthik Prabhakar,
Mariana Raykova
Abstract:
In this work, we study ad conversion measurement, a central functionality in digital advertising, where an advertiser seeks to estimate advertiser website (or mobile app) conversions attributed to ad impressions that users have interacted with on various publisher websites (or mobile apps). Using differential privacy (DP), a notion that has gained in popularity due to its strong mathematical guara…
▽ More
In this work, we study ad conversion measurement, a central functionality in digital advertising, where an advertiser seeks to estimate advertiser website (or mobile app) conversions attributed to ad impressions that users have interacted with on various publisher websites (or mobile apps). Using differential privacy (DP), a notion that has gained in popularity due to its strong mathematical guarantees, we develop a formal framework for private ad conversion measurement. In particular, we define the notion of an operationally valid configuration of the attribution rule, DP adjacency relation, contribution bounding scope and enforcement point. We then provide, for the set of configurations that most commonly arises in practice, a complete characterization, which uncovers a delicate interplay between attribution and privacy.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
Improved FPT Approximation Scheme and Approximate Kernel for Biclique-Free Max k-Weight SAT: Greedy Strikes Back
Authors:
Pasin Manurangsi
Abstract:
In the Max $k$-Weight SAT (aka Max SAT with Cardinality Constraint) problem, we are given a CNF formula with $n$ variables and $m$ clauses together with a positive integer $k$. The goal is to find an assignment where at most $k$ variables are set to one that satisfies as many constraints as possible. Recently, Jain et al. [SODA'23] gave an FPT approximation scheme (FPT-AS) with running time…
▽ More
In the Max $k$-Weight SAT (aka Max SAT with Cardinality Constraint) problem, we are given a CNF formula with $n$ variables and $m$ clauses together with a positive integer $k$. The goal is to find an assignment where at most $k$ variables are set to one that satisfies as many constraints as possible. Recently, Jain et al. [SODA'23] gave an FPT approximation scheme (FPT-AS) with running time $2^{O\left(\left(dk/ε\right)^d\right)} \cdot (n + m)^{O(1)}$ for Max $k$-Weight SAT when the incidence graph is $K_{d,d}$-free. They asked whether a polynomial-size approximate kernel exists. In this work, we answer this question positively by giving an $(1 - ε)$-approximate kernel with $\left(\frac{d k}ε\right)^{O(d)}$ variables. This also implies an improved FPT-AS with running time $(dk/ε)^{O(dk)} \cdot (n + m)^{O(1)}$. Our approximate kernel is based mainly on a couple of greedy strategies together with a sunflower lemma-style reduction rule.
△ Less
Submitted 4 June, 2024; v1 submitted 10 March, 2024;
originally announced March 2024.
-
Improved Lower Bound for Differentially Private Facility Location
Authors:
Pasin Manurangsi
Abstract:
We consider the differentially private (DP) facility location problem in the so called super-set output setting proposed by Gupta et al. [SODA 2010]. The current best known expected approximation ratio for an $ε$-DP algorithm is $O\left(\frac{\log n}{\sqrtε}\right)$ due to Cohen-Addad et al. [AISTATS 2022] where $n$ denote the size of the metric space, meanwhile the best known lower bound is…
▽ More
We consider the differentially private (DP) facility location problem in the so called super-set output setting proposed by Gupta et al. [SODA 2010]. The current best known expected approximation ratio for an $ε$-DP algorithm is $O\left(\frac{\log n}{\sqrtε}\right)$ due to Cohen-Addad et al. [AISTATS 2022] where $n$ denote the size of the metric space, meanwhile the best known lower bound is $Ω(1/\sqrtε)$ [NeurIPS 2019].
In this short note, we give a lower bound of $\tildeΩ\left(\min\left\{\log n, \sqrt{\frac{\log n}ε}\right\}\right)$ on the expected approximation ratio of any $ε$-DP algorithm, which is the first evidence that the approximation ratio has to grow with the size of the metric space.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
Training Differentially Private Ad Prediction Models with Semi-Sensitive Features
Authors:
Lynn Chua,
Qiliang Cui,
Badih Ghazi,
Charlie Harrison,
Pritish Kamath,
Walid Krichene,
Ravi Kumar,
Pasin Manurangsi,
Krishna Giri Narra,
Amer Sinha,
Avinash Varadarajan,
Chiyuan Zhang
Abstract:
Motivated by problems arising in digital advertising, we introduce the task of training differentially private (DP) machine learning models with semi-sensitive features. In this setting, a subset of the features is known to the attacker (and thus need not be protected) while the remaining features as well as the label are unknown to the attacker and should be protected by the DP guarantee. This ta…
▽ More
Motivated by problems arising in digital advertising, we introduce the task of training differentially private (DP) machine learning models with semi-sensitive features. In this setting, a subset of the features is known to the attacker (and thus need not be protected) while the remaining features as well as the label are unknown to the attacker and should be protected by the DP guarantee. This task interpolates between training the model with full DP (where the label and all features should be protected) or with label DP (where all the features are considered known, and only the label should be protected). We present a new algorithm for training DP models with semi-sensitive features. Through an empirical evaluation on real ads datasets, we demonstrate that our algorithm surpasses in utility the baselines of (i) DP stochastic gradient descent (DP-SGD) run on all features (known and unknown), and (ii) a label DP algorithm run only on the known features (while discarding the unknown ones).
△ Less
Submitted 26 January, 2024;
originally announced January 2024.
-
On Inapproximability of Reconfiguration Problems: PSPACE-Hardness and some Tight NP-Hardness Results
Authors:
Karthik C. S.,
Pasin Manurangsi
Abstract:
The field of combinatorial reconfiguration studies search problems with a focus on transforming one feasible solution into another. Recently, Ohsaka [STACS'23] put forth the Reconfiguration Inapproximability Hypothesis (RIH), which roughly asserts that for some $ε>0$, given as input a $k$-CSP instance (for some constant $k$) over some constant sized alphabet, and two satisfying assignments $ψ_s$ a…
▽ More
The field of combinatorial reconfiguration studies search problems with a focus on transforming one feasible solution into another. Recently, Ohsaka [STACS'23] put forth the Reconfiguration Inapproximability Hypothesis (RIH), which roughly asserts that for some $ε>0$, given as input a $k$-CSP instance (for some constant $k$) over some constant sized alphabet, and two satisfying assignments $ψ_s$ and $ψ_t$, it is PSPACE-hard to find a sequence of assignments starting from $ψ_s$ and ending at $ψ_t$ such that every assignment in the sequence satisfies at least $(1-ε)$ fraction of the constraints and also that every assignment in the sequence is obtained by changing its immediately preceding assignment (in the sequence) on exactly one variable. Assuming RIH, many important reconfiguration problems have been shown to be PSPACE-hard to approximate by Ohsaka [STACS'23; SODA'24].
In this paper, we prove RIH and establish the first (constant factor) PSPACE-hardness of approximation results for many reconfiguration problems, resolving an open question posed by Ito et al. [TCS'11]. Our proof uses known constructions of Probabilistically Checkable Proofs of Proximity (in a black-box manner) to create the gap. Independently, Hirahara and Ohsaka [STOC'24] have also proved RIH.
We also prove that the aforementioned $k$-CSP Reconfiguration problem is NP-hard to approximate to within a factor of $1/2 + ε$ (for any $ε>0$) when $k=2$. We complement this with a $(1/2 - ε)$-approximation polynomial time algorithm, which improves upon a $(1/4 - ε)$-approximation algorithm of Ohsaka [2023] (again for any $ε>0$). Finally, we show that Set Cover Reconfiguration is NP-hard to approximate to within a factor of $2 - ε$ for any constant $ε> 0$, which matches the simple linear-time 2-approximation algorithm by Ito et al. [TCS'11].
△ Less
Submitted 15 February, 2024; v1 submitted 28 December, 2023;
originally announced December 2023.
-
Optimal Unbiased Randomizers for Regression with Label Differential Privacy
Authors:
Ashwinkumar Badanidiyuru,
Badih Ghazi,
Pritish Kamath,
Ravi Kumar,
Ethan Leeman,
Pasin Manurangsi,
Avinash V Varadarajan,
Chiyuan Zhang
Abstract:
We propose a new family of label randomizers for training regression models under the constraint of label differential privacy (DP). In particular, we leverage the trade-offs between bias and variance to construct better label randomizers depending on a privately estimated prior distribution over the labels. We demonstrate that these randomizers achieve state-of-the-art privacy-utility trade-offs…
▽ More
We propose a new family of label randomizers for training regression models under the constraint of label differential privacy (DP). In particular, we leverage the trade-offs between bias and variance to construct better label randomizers depending on a privately estimated prior distribution over the labels. We demonstrate that these randomizers achieve state-of-the-art privacy-utility trade-offs on several datasets, highlighting the importance of reducing bias when training neural networks with label DP. We also provide theoretical results shedding light on the structural properties of the optimal unbiased randomizers.
△ Less
Submitted 9 December, 2023;
originally announced December 2023.
-
Summary Reports Optimization in the Privacy Sandbox Attribution Reporting API
Authors:
Hidayet Aksu,
Badih Ghazi,
Pritish Kamath,
Ravi Kumar,
Pasin Manurangsi,
Adam Sealfon,
Avinash V Varadarajan
Abstract:
The Privacy Sandbox Attribution Reporting API has been recently deployed by Google Chrome to support the basic advertising functionality of attribution reporting (aka conversion measurement) after deprecation of third-party cookies. The API implements a collection of privacy-enhancing guardrails including contribution bounding and noise injection. It also offers flexibility for the analyst to allo…
▽ More
The Privacy Sandbox Attribution Reporting API has been recently deployed by Google Chrome to support the basic advertising functionality of attribution reporting (aka conversion measurement) after deprecation of third-party cookies. The API implements a collection of privacy-enhancing guardrails including contribution bounding and noise injection. It also offers flexibility for the analyst to allocate the contribution budget.
In this work, we present methods for optimizing the allocation of the contribution budget for summary reports from the Attribution Reporting API. We evaluate them on real-world datasets as well as on a synthetic data model that we find to accurately capture real-world conversion data. Our results demonstrate that optimizing the parameters that can be set by the analyst can significantly improve the utility achieved by querying the API while satisfying the same privacy bounds.
△ Less
Submitted 22 November, 2023;
originally announced November 2023.
-
Sparsity-Preserving Differentially Private Training of Large Embedding Models
Authors:
Badih Ghazi,
Yangsibo Huang,
Pritish Kamath,
Ravi Kumar,
Pasin Manurangsi,
Amer Sinha,
Chiyuan Zhang
Abstract:
As the use of large embedding models in recommendation systems and language applications increases, concerns over user data privacy have also risen. DP-SGD, a training algorithm that combines differential privacy with stochastic gradient descent, has been the workhorse in protecting user privacy without compromising model accuracy by much. However, applying DP-SGD naively to embedding models can d…
▽ More
As the use of large embedding models in recommendation systems and language applications increases, concerns over user data privacy have also risen. DP-SGD, a training algorithm that combines differential privacy with stochastic gradient descent, has been the workhorse in protecting user privacy without compromising model accuracy by much. However, applying DP-SGD naively to embedding models can destroy gradient sparsity, leading to reduced training efficiency. To address this issue, we present two new algorithms, DP-FEST and DP-AdaFEST, that preserve gradient sparsity during private training of large embedding models. Our algorithms achieve substantial reductions ($10^6 \times$) in gradient size, while maintaining comparable levels of accuracy, on benchmark real-world datasets.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
User-Level Differential Privacy With Few Examples Per User
Authors:
Badih Ghazi,
Pritish Kamath,
Ravi Kumar,
Pasin Manurangsi,
Raghu Meka,
Chiyuan Zhang
Abstract:
Previous work on user-level differential privacy (DP) [Ghazi et al. NeurIPS 2021, Bun et al. STOC 2023] obtained generic algorithms that work for various learning tasks. However, their focus was on the example-rich regime, where the users have so many examples that each user could themselves solve the problem. In this work we consider the example-scarce regime, where each user has only a few examp…
▽ More
Previous work on user-level differential privacy (DP) [Ghazi et al. NeurIPS 2021, Bun et al. STOC 2023] obtained generic algorithms that work for various learning tasks. However, their focus was on the example-rich regime, where the users have so many examples that each user could themselves solve the problem. In this work we consider the example-scarce regime, where each user has only a few examples, and obtain the following results:
1. For approximate-DP, we give a generic transformation of any item-level DP algorithm to a user-level DP algorithm. Roughly speaking, the latter gives a (multiplicative) savings of $O_{\varepsilon,δ}(\sqrt{m})$ in terms of the number of users required for achieving the same utility, where $m$ is the number of examples per user. This algorithm, while recovering most known bounds for specific problems, also gives new bounds, e.g., for PAC learning.
2. For pure-DP, we present a simple technique for adapting the exponential mechanism [McSherry, Talwar FOCS 2007] to the user-level setting. This gives new bounds for a variety of tasks, such as private PAC learning, hypothesis selection, and distribution learning. For some of these problems, we show that our bounds are near-optimal.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
Hardness of Approximating Bounded-Degree Max 2-CSP and Independent Set on k-Claw-Free Graphs
Authors:
Euiwoong Lee,
Pasin Manurangsi
Abstract:
We consider the question of approximating Max 2-CSP where each variable appears in at most $d$ constraints (but with possibly arbitrarily large alphabet). There is a simple $(\frac{d+1}{2})$-approximation algorithm for the problem. We prove the following results for any sufficiently large $d$:
- Assuming the Unique Games Conjecture (UGC), it is NP-hard (under randomized reduction) to approximate…
▽ More
We consider the question of approximating Max 2-CSP where each variable appears in at most $d$ constraints (but with possibly arbitrarily large alphabet). There is a simple $(\frac{d+1}{2})$-approximation algorithm for the problem. We prove the following results for any sufficiently large $d$:
- Assuming the Unique Games Conjecture (UGC), it is NP-hard (under randomized reduction) to approximate this problem to within a factor of $\left(\frac{d}{2} - o(d)\right)$.
- It is NP-hard (under randomized reduction) to approximate the problem to within a factor of $\left(\frac{d}{3} - o(d)\right)$.
Thanks to a known connection [Dvorak et al., Algorithmica 2023], we establish the following hardness results for approximating Maximum Independent Set on $k$-claw-free graphs:
- Assuming the Unique Games Conjecture (UGC), it is NP-hard (under randomized reduction) to approximate this problem to within a factor of $\left(\frac{k}{4} - o(k)\right)$.
- It is NP-hard (under randomized reduction) to approximate the problem to within a factor of $\left(\frac{k}{3 + 2\sqrt{2}} - o(k)\right) \geq \left(\frac{k}{5.829} - o(k)\right)$.
In comparison, known approximation algorithms achieve $\left(\frac{k}{2} - o(k)\right)$-approximation in polynomial time [Neuwohner, STACS 2021; Thiery and Ward, SODA 2023] and $(\frac{k}{3} + o(k))$-approximation in quasi-polynomial time [Cygan et al., SODA 2013].
△ Less
Submitted 7 September, 2023;
originally announced September 2023.
-
Differentially Private Aggregation via Imperfect Shuffling
Authors:
Badih Ghazi,
Ravi Kumar,
Pasin Manurangsi,
Jelani Nelson,
Samson Zhou
Abstract:
In this paper, we introduce the imperfect shuffle differential privacy model, where messages sent from users are shuffled in an almost uniform manner before being observed by a curator for private aggregation. We then consider the private summation problem. We show that the standard split-and-mix protocol by Ishai et. al. [FOCS 2006] can be adapted to achieve near-optimal utility bounds in the imp…
▽ More
In this paper, we introduce the imperfect shuffle differential privacy model, where messages sent from users are shuffled in an almost uniform manner before being observed by a curator for private aggregation. We then consider the private summation problem. We show that the standard split-and-mix protocol by Ishai et. al. [FOCS 2006] can be adapted to achieve near-optimal utility bounds in the imperfect shuffle model. Specifically, we show that surprisingly, there is no additional error overhead necessary in the imperfect shuffle model.
△ Less
Submitted 28 August, 2023;
originally announced August 2023.
-
Optimizing Hierarchical Queries for the Attribution Reporting API
Authors:
Matthew Dawson,
Badih Ghazi,
Pritish Kamath,
Kapil Kumar,
Ravi Kumar,
Bo Luan,
Pasin Manurangsi,
Nishanth Mundru,
Harikesh Nair,
Adam Sealfon,
Shengyu Zhu
Abstract:
We study the task of performing hierarchical queries based on summary reports from the {\em Attribution Reporting API} for ad conversion measurement. We demonstrate that methods from optimization and differential privacy can help cope with the noise introduced by privacy guardrails in the API. In particular, we present algorithms for (i) denoising the API outputs and ensuring consistency across di…
▽ More
We study the task of performing hierarchical queries based on summary reports from the {\em Attribution Reporting API} for ad conversion measurement. We demonstrate that methods from optimization and differential privacy can help cope with the noise introduced by privacy guardrails in the API. In particular, we present algorithms for (i) denoising the API outputs and ensuring consistency across different levels of the tree, and (ii) optimizing the privacy budget across different levels of the tree. We provide an experimental evaluation of the proposed algorithms on public datasets.
△ Less
Submitted 27 November, 2023; v1 submitted 25 August, 2023;
originally announced August 2023.
-
A Note on Hardness of Computing Recursive Teaching Dimension
Authors:
Pasin Manurangsi
Abstract:
In this short note, we show that the problem of computing the recursive teaching dimension (RTD) for a concept class (given explicitly as input) requires $n^{Ω(\log n)}$-time, assuming the exponential time hypothesis (ETH). This matches the running time $n^{O(\log n)}$ of the brute-force algorithm for the problem.
In this short note, we show that the problem of computing the recursive teaching dimension (RTD) for a concept class (given explicitly as input) requires $n^{Ω(\log n)}$-time, assuming the exponential time hypothesis (ETH). This matches the running time $n^{O(\log n)}$ of the brute-force algorithm for the problem.
△ Less
Submitted 19 July, 2023;
originally announced July 2023.
-
Ticketed Learning-Unlearning Schemes
Authors:
Badih Ghazi,
Pritish Kamath,
Ravi Kumar,
Pasin Manurangsi,
Ayush Sekhari,
Chiyuan Zhang
Abstract:
We consider the learning--unlearning paradigm defined as follows. First given a dataset, the goal is to learn a good predictor, such as one minimizing a certain loss. Subsequently, given any subset of examples that wish to be unlearnt, the goal is to learn, without the knowledge of the original training dataset, a good predictor that is identical to the predictor that would have been produced when…
▽ More
We consider the learning--unlearning paradigm defined as follows. First given a dataset, the goal is to learn a good predictor, such as one minimizing a certain loss. Subsequently, given any subset of examples that wish to be unlearnt, the goal is to learn, without the knowledge of the original training dataset, a good predictor that is identical to the predictor that would have been produced when learning from scratch on the surviving examples.
We propose a new ticketed model for learning--unlearning wherein the learning algorithm can send back additional information in the form of a small-sized (encrypted) ``ticket'' to each participating training example, in addition to retaining a small amount of ``central'' information for later. Subsequently, the examples that wish to be unlearnt present their tickets to the unlearning algorithm, which additionally uses the central information to return a new predictor. We provide space-efficient ticketed learning--unlearning schemes for a broad family of concept classes, including thresholds, parities, intersection-closed classes, among others.
En route, we introduce the count-to-zero problem, where during unlearning, the goal is to simply know if there are any examples that survived. We give a ticketed learning--unlearning scheme for this problem that relies on the construction of Sperner families with certain properties, which might be of independent interest.
△ Less
Submitted 27 June, 2023;
originally announced June 2023.
-
Differentially Private Data Release over Multiple Tables
Authors:
Badih Ghazi,
Xiao Hu,
Ravi Kumar,
Pasin Manurangsi
Abstract:
We study synthetic data release for answering multiple linear queries over a set of database tables in a differentially private way. Two special cases have been considered in the literature: how to release a synthetic dataset for answering multiple linear queries over a single table, and how to release the answer for a single counting (join size) query over a set of database tables. Compared to th…
▽ More
We study synthetic data release for answering multiple linear queries over a set of database tables in a differentially private way. Two special cases have been considered in the literature: how to release a synthetic dataset for answering multiple linear queries over a single table, and how to release the answer for a single counting (join size) query over a set of database tables. Compared to the single-table case, the join operator makes query answering challenging, since the sensitivity (i.e., by how much an individual data record can affect the answer) could be heavily amplified by complex join relationships.
We present an algorithm for the general problem, and prove a lower bound illustrating that our general algorithm achieves parameterized optimality (up to logarithmic factors) on some simple queries (e.g., two-table join queries) in the most commonly-used privacy parameter regimes. For the case of hierarchical joins, we present a data partition procedure that exploits the concept of {\em uniformized sensitivities} to further improve the utility.
△ Less
Submitted 27 June, 2023;
originally announced June 2023.
-
On Differentially Private Sampling from Gaussian and Product Distributions
Authors:
Badih Ghazi,
Xiao Hu,
Ravi Kumar,
Pasin Manurangsi
Abstract:
Given a dataset of $n$ i.i.d. samples from an unknown distribution $P$, we consider the problem of generating a sample from a distribution that is close to $P$ in total variation distance, under the constraint of differential privacy (DP). We study the problem when $P$ is a multi-dimensional Gaussian distribution, under different assumptions on the information available to the DP mechanism: known…
▽ More
Given a dataset of $n$ i.i.d. samples from an unknown distribution $P$, we consider the problem of generating a sample from a distribution that is close to $P$ in total variation distance, under the constraint of differential privacy (DP). We study the problem when $P$ is a multi-dimensional Gaussian distribution, under different assumptions on the information available to the DP mechanism: known covariance, unknown bounded covariance, and unknown unbounded covariance. We present new DP sampling algorithms, and show that they achieve near-optimal sample complexity in the first two settings. Moreover, when $P$ is a product distribution on the binary hypercube, we obtain a pure-DP algorithm whereas only an approximate-DP algorithm (with slightly worse sample complexity) was previously known.
△ Less
Submitted 21 June, 2023;
originally announced June 2023.
-
Pure-DP Aggregation in the Shuffle Model: Error-Optimal and Communication-Efficient
Authors:
Badih Ghazi,
Ravi Kumar,
Pasin Manurangsi
Abstract:
We obtain a new protocol for binary counting in the $\varepsilon$-shuffle-DP model with error $O(1/\varepsilon)$ and expected communication $\tilde{O}\left(\frac{\log n}{\varepsilon}\right)$ messages per user. Previous protocols incur either an error of $O(1/\varepsilon^{1.5})$ with $O_\varepsilon(\log{n})$ messages per user (Ghazi et al., ITC 2020) or an error of $O(1/\varepsilon)$ with…
▽ More
We obtain a new protocol for binary counting in the $\varepsilon$-shuffle-DP model with error $O(1/\varepsilon)$ and expected communication $\tilde{O}\left(\frac{\log n}{\varepsilon}\right)$ messages per user. Previous protocols incur either an error of $O(1/\varepsilon^{1.5})$ with $O_\varepsilon(\log{n})$ messages per user (Ghazi et al., ITC 2020) or an error of $O(1/\varepsilon)$ with $O_\varepsilon(n^{2.5})$ messages per user (Cheu and Yan, TPDP 2022). Using the new protocol, we obtained improved $\varepsilon$-shuffle-DP protocols for real summation and histograms.
△ Less
Submitted 28 May, 2023;
originally announced May 2023.
-
On User-Level Private Convex Optimization
Authors:
Badih Ghazi,
Pritish Kamath,
Ravi Kumar,
Raghu Meka,
Pasin Manurangsi,
Chiyuan Zhang
Abstract:
We introduce a new mechanism for stochastic convex optimization (SCO) with user-level differential privacy guarantees. The convergence rates of this mechanism are similar to those in the prior work of Levy et al. (2021); Narayanan et al. (2022), but with two important improvements. Our mechanism does not require any smoothness assumptions on the loss. Furthermore, our bounds are also the first whe…
▽ More
We introduce a new mechanism for stochastic convex optimization (SCO) with user-level differential privacy guarantees. The convergence rates of this mechanism are similar to those in the prior work of Levy et al. (2021); Narayanan et al. (2022), but with two important improvements. Our mechanism does not require any smoothness assumptions on the loss. Furthermore, our bounds are also the first where the minimum number of users needed for user-level privacy has no dependence on the dimension and only a logarithmic dependence on the desired excess error. The main idea underlying the new mechanism is to show that the optimizers of strongly convex losses have low local deletion sensitivity, along with an output perturbation method for functions with low local deletion sensitivity, which could be of independent interest.
△ Less
Submitted 8 May, 2023;
originally announced May 2023.
-
On Maximum Bipartite Matching with Separation
Authors:
Pasin Manurangsi,
Erel Segal-Halevi,
Warut Suksompong
Abstract:
Maximum bipartite matching is a fundamental algorithmic problem which can be solved in polynomial time. We consider a natural variant in which there is a separation constraint: the vertices on one side lie on a path or a grid, and two vertices that are close to each other are not allowed to be matched simultaneously. We show that the problem is hard to approximate even for paths, and provide const…
▽ More
Maximum bipartite matching is a fundamental algorithmic problem which can be solved in polynomial time. We consider a natural variant in which there is a separation constraint: the vertices on one side lie on a path or a grid, and two vertices that are close to each other are not allowed to be matched simultaneously. We show that the problem is hard to approximate even for paths, and provide constant-factor approximation algorithms for both paths and grids.
△ Less
Submitted 3 March, 2023;
originally announced March 2023.
-
Towards Separating Computational and Statistical Differential Privacy
Authors:
Badih Ghazi,
Rahul Ilango,
Pritish Kamath,
Ravi Kumar,
Pasin Manurangsi
Abstract:
Computational differential privacy (CDP) is a natural relaxation of the standard notion of (statistical) differential privacy (SDP) proposed by Beimel, Nissim, and Omri (CRYPTO 2008) and Mironov, Pandey, Reingold, and Vadhan (CRYPTO 2009). In contrast to SDP, CDP only requires privacy guarantees to hold against computationally-bounded adversaries rather than computationally-unbounded statistical a…
▽ More
Computational differential privacy (CDP) is a natural relaxation of the standard notion of (statistical) differential privacy (SDP) proposed by Beimel, Nissim, and Omri (CRYPTO 2008) and Mironov, Pandey, Reingold, and Vadhan (CRYPTO 2009). In contrast to SDP, CDP only requires privacy guarantees to hold against computationally-bounded adversaries rather than computationally-unbounded statistical adversaries. Despite the question being raised explicitly in several works (e.g., Bun, Chen, and Vadhan, TCC 2016), it has remained tantalizingly open whether there is any task achievable with the CDP notion but not the SDP notion. Even a candidate such task is unknown. Indeed, it is even unclear what the truth could be!
In this work, we give the first construction of a task achievable with the CDP notion but not the SDP notion, under the following strong but plausible cryptographic assumptions: (1) Non-Interactive Witness Indistinguishable Proofs, (2) Laconic Collision-Resistant Keyless Hash Functions, (3) Differing-Inputs Obfuscation for Public-Coin Samplers. In particular, we construct a task for which there exists an $\varepsilon$-CDP mechanism with $\varepsilon = O(1)$ achieving $1-o(1)$ utility, but any $(\varepsilon, δ)$-SDP mechanism, including computationally-unbounded ones, that achieves a constant utility must use either a super-constant $\varepsilon$ or an inverse-polynomially large $δ$.
To prove this, we introduce a new approach for showing that a mechanism satisfies CDP: first we show that a mechanism is "private" against a certain class of decision tree adversaries, and then we use cryptographic constructions to "lift" this into privacy against computationally bounded adversaries. We believe this approach could be useful to devise further tasks separating CDP from SDP.
△ Less
Submitted 23 October, 2023; v1 submitted 30 December, 2022;
originally announced January 2023.
-
On Differentially Private Counting on Trees
Authors:
Badih Ghazi,
Pritish Kamath,
Ravi Kumar,
Pasin Manurangsi,
Kewen Wu
Abstract:
We study the problem of performing counting queries at different levels in hierarchical structures while preserving individuals' privacy. Motivated by applications, we propose a new error measure for this problem by considering a combination of multiplicative and additive approximation to the query results. We examine known mechanisms in differential privacy (DP) and prove their optimality, under…
▽ More
We study the problem of performing counting queries at different levels in hierarchical structures while preserving individuals' privacy. Motivated by applications, we propose a new error measure for this problem by considering a combination of multiplicative and additive approximation to the query results. We examine known mechanisms in differential privacy (DP) and prove their optimality, under this measure, in the pure-DP setting. In the approximate-DP setting, we design new algorithms achieving significant improvements over known ones.
△ Less
Submitted 26 April, 2023; v1 submitted 22 December, 2022;
originally announced December 2022.
-
Regression with Label Differential Privacy
Authors:
Badih Ghazi,
Pritish Kamath,
Ravi Kumar,
Ethan Leeman,
Pasin Manurangsi,
Avinash V Varadarajan,
Chiyuan Zhang
Abstract:
We study the task of training regression models with the guarantee of label differential privacy (DP). Based on a global prior distribution on label values, which could be obtained privately, we derive a label DP randomization mechanism that is optimal under a given regression loss function. We prove that the optimal mechanism takes the form of a "randomized response on bins", and propose an effic…
▽ More
We study the task of training regression models with the guarantee of label differential privacy (DP). Based on a global prior distribution on label values, which could be obtained privately, we derive a label DP randomization mechanism that is optimal under a given regression loss function. We prove that the optimal mechanism takes the form of a "randomized response on bins", and propose an efficient algorithm for finding the optimal bin values. We carry out a thorough experimental evaluation on several datasets demonstrating the efficacy of our algorithm.
△ Less
Submitted 4 October, 2023; v1 submitted 12 December, 2022;
originally announced December 2022.
-
Differentially Private Heatmaps
Authors:
Badih Ghazi,
Junfeng He,
Kai Kohlhoff,
Ravi Kumar,
Pasin Manurangsi,
Vidhya Navalpakkam,
Nachiappan Valliappan
Abstract:
We consider the task of producing heatmaps from users' aggregated data while protecting their privacy. We give a differentially private (DP) algorithm for this task and demonstrate its advantages over previous algorithms on real-world datasets.
Our core algorithmic primitive is a DP procedure that takes in a set of distributions and produces an output that is close in Earth Mover's Distance to t…
▽ More
We consider the task of producing heatmaps from users' aggregated data while protecting their privacy. We give a differentially private (DP) algorithm for this task and demonstrate its advantages over previous algorithms on real-world datasets.
Our core algorithmic primitive is a DP procedure that takes in a set of distributions and produces an output that is close in Earth Mover's Distance to the average of the inputs. We prove theoretical bounds on the error of our algorithm under a certain sparsity assumption and that these are near-optimal.
△ Less
Submitted 24 November, 2022;
originally announced November 2022.
-
Differentially Private Fair Division
Authors:
Pasin Manurangsi,
Warut Suksompong
Abstract:
Fairness and privacy are two important concerns in social decision-making processes such as resource allocation. We study privacy in the fair allocation of indivisible resources using the well-established framework of differential privacy. We present algorithms for approximate envy-freeness and proportionality when two instances are considered to be adjacent if they differ only on the utility of a…
▽ More
Fairness and privacy are two important concerns in social decision-making processes such as resource allocation. We study privacy in the fair allocation of indivisible resources using the well-established framework of differential privacy. We present algorithms for approximate envy-freeness and proportionality when two instances are considered to be adjacent if they differ only on the utility of a single agent for a single item. On the other hand, we provide strong negative results for both fairness criteria when the adjacency notion allows the entire utility function of a single agent to change.
△ Less
Submitted 23 November, 2022;
originally announced November 2022.
-
Private Ad Modeling with DP-SGD
Authors:
Carson Denison,
Badih Ghazi,
Pritish Kamath,
Ravi Kumar,
Pasin Manurangsi,
Krishna Giri Narra,
Amer Sinha,
Avinash V Varadarajan,
Chiyuan Zhang
Abstract:
A well-known algorithm in privacy-preserving ML is differentially private stochastic gradient descent (DP-SGD). While this algorithm has been evaluated on text and image data, it has not been previously applied to ads data, which are notorious for their high class imbalance and sparse gradient updates. In this work we apply DP-SGD to several ad modeling tasks including predicting click-through rat…
▽ More
A well-known algorithm in privacy-preserving ML is differentially private stochastic gradient descent (DP-SGD). While this algorithm has been evaluated on text and image data, it has not been previously applied to ads data, which are notorious for their high class imbalance and sparse gradient updates. In this work we apply DP-SGD to several ad modeling tasks including predicting click-through rates, conversion rates, and number of conversion events, and evaluate their privacy-utility trade-off on real-world datasets. Our work is the first to empirically demonstrate that DP-SGD can provide both privacy and utility for ad modeling tasks.
△ Less
Submitted 4 October, 2023; v1 submitted 21 November, 2022;
originally announced November 2022.
-
Private Counting of Distinct and k-Occurring Items in Time Windows
Authors:
Badih Ghazi,
Ravi Kumar,
Pasin Manurangsi,
Jelani Nelson
Abstract:
In this work, we study the task of estimating the numbers of distinct and $k$-occurring items in a time window under the constraint of differential privacy (DP). We consider several variants depending on whether the queries are on general time windows (between times $t_1$ and $t_2$), or are restricted to being cumulative (between times $1$ and $t_2$), and depending on whether the DP neighboring re…
▽ More
In this work, we study the task of estimating the numbers of distinct and $k$-occurring items in a time window under the constraint of differential privacy (DP). We consider several variants depending on whether the queries are on general time windows (between times $t_1$ and $t_2$), or are restricted to being cumulative (between times $1$ and $t_2$), and depending on whether the DP neighboring relation is event-level or the more stringent item-level. We obtain nearly tight upper and lower bounds on the errors of DP algorithms for these problems. En route, we obtain an event-level DP algorithm for estimating, at each time step, the number of distinct items seen over the last $W$ updates with error polylogarithmic in $W$; this answers an open question of Bolot et al. (ICDT 2013).
△ Less
Submitted 21 November, 2022;
originally announced November 2022.
-
Improved Inapproximability of VC Dimension and Littlestone's Dimension via (Unbalanced) Biclique
Authors:
Pasin Manurangsi
Abstract:
We study the complexity of computing (and approximating) VC Dimension and Littlestone's Dimension when we are given the concept class explicitly. We give a simple reduction from Maximum (Unbalanced) Biclique problem to approximating VC Dimension and Littlestone's Dimension. With this connection, we derive a range of hardness of approximation results and running time lower bounds. For example, unde…
▽ More
We study the complexity of computing (and approximating) VC Dimension and Littlestone's Dimension when we are given the concept class explicitly. We give a simple reduction from Maximum (Unbalanced) Biclique problem to approximating VC Dimension and Littlestone's Dimension. With this connection, we derive a range of hardness of approximation results and running time lower bounds. For example, under the (randomized) Gap-Exponential Time Hypothesis or the Strongish Planted Clique Hypothesis, we show a tight inapproximability result: both dimensions are hard to approximate to within a factor of $o(\log n)$ in polynomial-time. These improve upon constant-factor inapproximability results from [Manurangsi and Rubinstein, COLT 2017].
△ Less
Submitted 2 November, 2022;
originally announced November 2022.
-
Anonymized Histograms in Intermediate Privacy Models
Authors:
Badih Ghazi,
Pritish Kamath,
Ravi Kumar,
Pasin Manurangsi
Abstract:
We study the problem of privately computing the anonymized histogram (a.k.a. unattributed histogram), which is defined as the histogram without item labels. Previous works have provided algorithms with $\ell_1$- and $\ell_2^2$-errors of $O_\varepsilon(\sqrt{n})$ in the central model of differential privacy (DP).
In this work, we provide an algorithm with a nearly matching error guarantee of…
▽ More
We study the problem of privately computing the anonymized histogram (a.k.a. unattributed histogram), which is defined as the histogram without item labels. Previous works have provided algorithms with $\ell_1$- and $\ell_2^2$-errors of $O_\varepsilon(\sqrt{n})$ in the central model of differential privacy (DP).
In this work, we provide an algorithm with a nearly matching error guarantee of $\tilde{O}_\varepsilon(\sqrt{n})$ in the shuffle DP and pan-private models. Our algorithm is very simple: it just post-processes the discrete Laplace-noised histogram! Using this algorithm as a subroutine, we show applications in privately estimating symmetric properties of distributions such as entropy, support coverage, and support size.
△ Less
Submitted 27 October, 2022;
originally announced October 2022.
-
Private Isotonic Regression
Authors:
Badih Ghazi,
Pritish Kamath,
Ravi Kumar,
Pasin Manurangsi
Abstract:
In this paper, we consider the problem of differentially private (DP) algorithms for isotonic regression. For the most general problem of isotonic regression over a partially ordered set (poset) $\mathcal{X}$ and for any Lipschitz loss function, we obtain a pure-DP algorithm that, given $n$ input points, has an expected excess empirical risk of roughly…
▽ More
In this paper, we consider the problem of differentially private (DP) algorithms for isotonic regression. For the most general problem of isotonic regression over a partially ordered set (poset) $\mathcal{X}$ and for any Lipschitz loss function, we obtain a pure-DP algorithm that, given $n$ input points, has an expected excess empirical risk of roughly $\mathrm{width}(\mathcal{X}) \cdot \log|\mathcal{X}| / n$, where $\mathrm{width}(\mathcal{X})$ is the width of the poset. In contrast, we also obtain a near-matching lower bound of roughly $(\mathrm{width}(\mathcal{X}) + \log |\mathcal{X}|) / n$, that holds even for approximate-DP algorithms. Moreover, we show that the above bounds are essentially the best that can be obtained without utilizing any further structure of the poset.
In the special case of a totally ordered set and for $\ell_1$ and $\ell_2^2$ losses, our algorithm can be implemented in near-linear running time; we also provide extensions of this algorithm to the problem of private isotonic regression with additional structural constraints on the output function.
△ Less
Submitted 27 October, 2022;
originally announced October 2022.
-
Algorithms with More Granular Differential Privacy Guarantees
Authors:
Badih Ghazi,
Ravi Kumar,
Pasin Manurangsi,
Thomas Steinke
Abstract:
Differential privacy is often applied with a privacy parameter that is larger than the theory suggests is ideal; various informal justifications for tolerating large privacy parameters have been proposed. In this work, we consider partial differential privacy (DP), which allows quantifying the privacy guarantee on a per-attribute basis. In this framework, we study several basic data analysis and l…
▽ More
Differential privacy is often applied with a privacy parameter that is larger than the theory suggests is ideal; various informal justifications for tolerating large privacy parameters have been proposed. In this work, we consider partial differential privacy (DP), which allows quantifying the privacy guarantee on a per-attribute basis. In this framework, we study several basic data analysis and learning tasks, and design algorithms whose per-attribute privacy parameter is smaller that the best possible privacy parameter for the entire record of a person (i.e., all the attributes).
△ Less
Submitted 8 September, 2022;
originally announced September 2022.
-
Cryptographic Hardness of Learning Halfspaces with Massart Noise
Authors:
Ilias Diakonikolas,
Daniel M. Kane,
Pasin Manurangsi,
Lisheng Ren
Abstract:
We study the complexity of PAC learning halfspaces in the presence of Massart noise. In this problem, we are given i.i.d. labeled examples $(\mathbf{x}, y) \in \mathbb{R}^N \times \{ \pm 1\}$, where the distribution of $\mathbf{x}$ is arbitrary and the label $y$ is a Massart corruption of $f(\mathbf{x})$, for an unknown halfspace $f: \mathbb{R}^N \to \{ \pm 1\}$, with flipping probability…
▽ More
We study the complexity of PAC learning halfspaces in the presence of Massart noise. In this problem, we are given i.i.d. labeled examples $(\mathbf{x}, y) \in \mathbb{R}^N \times \{ \pm 1\}$, where the distribution of $\mathbf{x}$ is arbitrary and the label $y$ is a Massart corruption of $f(\mathbf{x})$, for an unknown halfspace $f: \mathbb{R}^N \to \{ \pm 1\}$, with flipping probability $η(\mathbf{x}) \leq η< 1/2$. The goal of the learner is to compute a hypothesis with small 0-1 error. Our main result is the first computational hardness result for this learning problem. Specifically, assuming the (widely believed) subexponential-time hardness of the Learning with Errors (LWE) problem, we show that no polynomial-time Massart halfspace learner can achieve error better than $Ω(η)$, even if the optimal 0-1 error is small, namely $\mathrm{OPT} = 2^{-\log^{c} (N)}$ for any universal constant $c \in (0, 1)$. Prior work had provided qualitatively similar evidence of hardness in the Statistical Query model. Our computational hardness result essentially resolves the polynomial PAC learnability of Massart halfspaces, by showing that known efficient learning algorithms for the problem are nearly best possible.
△ Less
Submitted 28 July, 2022;
originally announced July 2022.
-
Faster Privacy Accounting via Evolving Discretization
Authors:
Badih Ghazi,
Pritish Kamath,
Ravi Kumar,
Pasin Manurangsi
Abstract:
We introduce a new algorithm for numerical composition of privacy random variables, useful for computing the accurate differential privacy parameters for composition of mechanisms. Our algorithm achieves a running time and memory usage of $\mathrm{polylog}(k)$ for the task of self-composing a mechanism, from a broad class of mechanisms, $k$ times; this class, e.g., includes the sub-sampled Gaussia…
▽ More
We introduce a new algorithm for numerical composition of privacy random variables, useful for computing the accurate differential privacy parameters for composition of mechanisms. Our algorithm achieves a running time and memory usage of $\mathrm{polylog}(k)$ for the task of self-composing a mechanism, from a broad class of mechanisms, $k$ times; this class, e.g., includes the sub-sampled Gaussian mechanism, that appears in the analysis of differentially private stochastic gradient descent. By comparison, recent work by Gopi et al. (NeurIPS 2021) has obtained a running time of $\widetilde{O}(\sqrt{k})$ for the same task. Our approach extends to the case of composing $k$ different mechanisms in the same class, improving upon their running time and memory usage from $\widetilde{O}(k^{1.5})$ to $\widetilde{O}(k)$.
△ Less
Submitted 10 July, 2022;
originally announced July 2022.
-
Connect the Dots: Tighter Discrete Approximations of Privacy Loss Distributions
Authors:
Vadym Doroshenko,
Badih Ghazi,
Pritish Kamath,
Ravi Kumar,
Pasin Manurangsi
Abstract:
The privacy loss distribution (PLD) provides a tight characterization of the privacy loss of a mechanism in the context of differential privacy (DP). Recent work has shown that PLD-based accounting allows for tighter $(\varepsilon, δ)$-DP guarantees for many popular mechanisms compared to other known methods. A key question in PLD-based accounting is how to approximate any (potentially continuous)…
▽ More
The privacy loss distribution (PLD) provides a tight characterization of the privacy loss of a mechanism in the context of differential privacy (DP). Recent work has shown that PLD-based accounting allows for tighter $(\varepsilon, δ)$-DP guarantees for many popular mechanisms compared to other known methods. A key question in PLD-based accounting is how to approximate any (potentially continuous) PLD with a PLD over any specified discrete support.
We present a novel approach to this problem. Our approach supports both pessimistic estimation, which overestimates the hockey-stick divergence (i.e., $δ$) for any value of $\varepsilon$, and optimistic estimation, which underestimates the hockey-stick divergence. Moreover, we show that our pessimistic estimate is the best possible among all pessimistic estimates. Experimental evaluation shows that our approach can work with much larger discretization intervals while keeping a similar error bound compared to previous approaches and yet give a better approximation than existing methods.
△ Less
Submitted 10 July, 2022;
originally announced July 2022.
-
Fixing Knockout Tournaments With Seeds
Authors:
Pasin Manurangsi,
Warut Suksompong
Abstract:
Knockout tournaments constitute a popular format for organizing sports competitions. While prior results have shown that it is often possible to manipulate a knockout tournament by fixing the bracket, these results ignore the prevalent aspect of player seeds, which can significantly constrain the chosen bracket. We show that certain structural conditions that guarantee that a player can win a knoc…
▽ More
Knockout tournaments constitute a popular format for organizing sports competitions. While prior results have shown that it is often possible to manipulate a knockout tournament by fixing the bracket, these results ignore the prevalent aspect of player seeds, which can significantly constrain the chosen bracket. We show that certain structural conditions that guarantee that a player can win a knockout tournament without seeds are no longer sufficient in light of seed constraints. On the other hand, we prove that when the pairwise match outcomes are generated randomly, all players are still likely to be knockout winners under the same probability threshold with seeds as without seeds. In addition, we investigate the complexity of deciding whether a manipulation is possible when seeds are present.
△ Less
Submitted 23 April, 2022;
originally announced April 2022.
-
Differentially Private All-Pairs Shortest Path Distances: Improved Algorithms and Lower Bounds
Authors:
Badih Ghazi,
Ravi Kumar,
Pasin Manurangsi,
Jelani Nelson
Abstract:
We study the problem of releasing the weights of all-pair shortest paths in a weighted undirected graph with differential privacy (DP). In this setting, the underlying graph is fixed and two graphs are neighbors if their edge weights differ by at most $1$ in the $\ell_1$-distance. We give an $ε$-DP algorithm with additive error $\tilde{O}(n^{2/3} / ε)$ and an $(ε, δ)$-DP algorithm with additive er…
▽ More
We study the problem of releasing the weights of all-pair shortest paths in a weighted undirected graph with differential privacy (DP). In this setting, the underlying graph is fixed and two graphs are neighbors if their edge weights differ by at most $1$ in the $\ell_1$-distance. We give an $ε$-DP algorithm with additive error $\tilde{O}(n^{2/3} / ε)$ and an $(ε, δ)$-DP algorithm with additive error $\tilde{O}(\sqrt{n} / ε)$ where $n$ denotes the number of vertices. This positively answers a question of Sealfon (PODS'16), who asked whether a $o(n)$-error algorithm exists. We also show that an additive error of $Ω(n^{1/6})$ is necessary for any sufficiently small $ε, δ> 0$.
Finally, we consider a relaxed setting where a multiplicative approximation is allowed. We show that, with a multiplicative approximation factor $k$, %$2k - 1$, the additive error can be reduced to $\tilde{O}\left(n^{1/2 + O(1/k)} / ε\right)$ in the $ε$-DP case and $\tilde{O}(n^{1/3 + O(1/k)} / ε)$ in the $(ε, δ)$-DP case, respectively.
△ Less
Submitted 30 March, 2022;
originally announced March 2022.
-
Improved Approximation Algorithms and Lower Bounds for Search-Diversification Problems
Authors:
Amir Abboud,
Vincent Cohen-Addad,
Euiwoong Lee,
Pasin Manurangsi
Abstract:
We study several questions related to diversifying search results. We give improved approximation algorithms in each of the following problems, together with some lower bounds.
- We give a polynomial-time approximation scheme (PTAS) for a diversified search ranking problem [Bansal et al., ICALP 2010] whose objective is to minimizes the discounted cumulative gain. Our PTAS runs in time…
▽ More
We study several questions related to diversifying search results. We give improved approximation algorithms in each of the following problems, together with some lower bounds.
- We give a polynomial-time approximation scheme (PTAS) for a diversified search ranking problem [Bansal et al., ICALP 2010] whose objective is to minimizes the discounted cumulative gain. Our PTAS runs in time $n^{2^{O(\log(1/ε)/ε)}} \cdot m^{O(1)}$ where $n$ denotes the number of elements in the databases. Complementing this, we show that no PTAS can run in time $f(ε) \cdot (nm)^{2^{o(1/ε)}}$ assuming Gap-ETH; therefore our running time is nearly tight. Both of our bounds answer open questions of Bansal et al.
- We next consider the Max-Sum Dispersion problem, whose objective is to select $k$ out of $n$ elements that maximizes the dispersion, which is defined as the sum of the pairwise distances under a given metric. We give a quasipolynomial-time approximation scheme for the problem which runs in time $n^{O_ε(\log n)}$. This improves upon previously known polynomial-time algorithms with approximate ratios 0.5 [Hassin et al., Oper. Res. Lett. 1997; Borodin et al., ACM Trans. Algorithms 2017]. Furthermore, we observe that known reductions rule out approximation schemes that run in $n^{\tilde{o}_ε(\log n)}$ time assuming ETH.
- We consider a generalization of Max-Sum Dispersion called Max-Sum Diversification. In addition to the sum of pairwise distance, the objective includes another function $f$. For monotone submodular $f$, we give a quasipolynomial-time algorithm with approximation ratio arbitrarily close to $(1 - 1/e)$. This improves upon the best polynomial-time algorithm which has approximation ratio $0.5$ by Borodin et al. Furthermore, the $(1 - 1/e)$ factor is tight as achieving better-than-$(1 - 1/e)$ approximation is NP-hard [Feige, J. ACM 1998].
△ Less
Submitted 3 March, 2022;
originally announced March 2022.
-
Private Rank Aggregation in Central and Local Models
Authors:
Daniel Alabi,
Badih Ghazi,
Ravi Kumar,
Pasin Manurangsi
Abstract:
In social choice theory, (Kemeny) rank aggregation is a well-studied problem where the goal is to combine rankings from multiple voters into a single ranking on the same set of items. Since rankings can reveal preferences of voters (which a voter might like to keep private), it is important to aggregate preferences in such a way to preserve privacy. In this work, we present differentially private…
▽ More
In social choice theory, (Kemeny) rank aggregation is a well-studied problem where the goal is to combine rankings from multiple voters into a single ranking on the same set of items. Since rankings can reveal preferences of voters (which a voter might like to keep private), it is important to aggregate preferences in such a way to preserve privacy. In this work, we present differentially private algorithms for rank aggregation in the pure and approximate settings along with distribution-independent utility upper and lower bounds. In addition to bounds in the central model, we also present utility bounds for the local model of differential privacy.
△ Less
Submitted 29 December, 2021;
originally announced December 2021.
-
The Price of Justified Representation
Authors:
Edith Elkind,
Piotr Faliszewski,
Ayumi Igarashi,
Pasin Manurangsi,
Ulrike Schmidt-Kraepelin,
Warut Suksompong
Abstract:
In multiwinner approval voting, the goal is to select $k$-member committees based on voters' approval ballots. A well-studied concept of proportionality in this context is the justified representation (JR) axiom, which demands that no large cohesive group of voters remains unrepresented. However, the JR axiom may conflict with other desiderata, such as coverage (maximizing the number of voters who…
▽ More
In multiwinner approval voting, the goal is to select $k$-member committees based on voters' approval ballots. A well-studied concept of proportionality in this context is the justified representation (JR) axiom, which demands that no large cohesive group of voters remains unrepresented. However, the JR axiom may conflict with other desiderata, such as coverage (maximizing the number of voters who approve at least one committee member) or social welfare (maximizing the number of approvals obtained by committee members). In this work, we investigate the impact of imposing the JR axiom (as well as the more demanding EJR axiom) on social welfare and coverage. Our approach is threefold: we derive worst-case bounds on the loss of welfare/coverage that is caused by imposing JR, study the computational complexity of finding 'good' committees that provide JR (obtaining a hardness result, an approximation algorithm, and an exact algorithm for one-dimensional preferences), and examine this setting empirically on several synthetic datasets.
△ Less
Submitted 13 December, 2021; v1 submitted 11 December, 2021;
originally announced December 2021.
-
Private Robust Estimation by Stabilizing Convex Relaxations
Authors:
Pravesh K. Kothari,
Pasin Manurangsi,
Ameya Velingker
Abstract:
We give the first polynomial time and sample $(ε, δ)$-differentially private (DP) algorithm to estimate the mean, covariance and higher moments in the presence of a constant fraction of adversarial outliers. Our algorithm succeeds for families of distributions that satisfy two well-studied properties in prior works on robust estimation: certifiable subgaussianity of directional moments and certifi…
▽ More
We give the first polynomial time and sample $(ε, δ)$-differentially private (DP) algorithm to estimate the mean, covariance and higher moments in the presence of a constant fraction of adversarial outliers. Our algorithm succeeds for families of distributions that satisfy two well-studied properties in prior works on robust estimation: certifiable subgaussianity of directional moments and certifiable hypercontractivity of degree 2 polynomials. Our recovery guarantees hold in the "right affine-invariant norms": Mahalanobis distance for mean, multiplicative spectral and relative Frobenius distance guarantees for covariance and injective norms for higher moments. Prior works obtained private robust algorithms for mean estimation of subgaussian distributions with bounded covariance. For covariance estimation, ours is the first efficient algorithm (even in the absence of outliers) that succeeds without any condition-number assumptions.
Our algorithms arise from a new framework that provides a general blueprint for modifying convex relaxations for robust estimation to satisfy strong worst-case stability guarantees in the appropriate parameter norms whenever the algorithms produce witnesses of correctness in their run. We verify such guarantees for a modification of standard sum-of-squares (SoS) semidefinite programming relaxations for robust estimation. Our privacy guarantees are obtained by combining stability guarantees with a new "estimate dependent" noise injection mechanism in which noise scales with the eigenvalues of the estimated covariance. We believe this framework will be useful more generally in obtaining DP counterparts of robust estimators.
Independently of our work, Ashtiani and Liaw [AL21] also obtained a polynomial time and sample private robust estimation algorithm for Gaussian distributions.
△ Less
Submitted 7 December, 2021;
originally announced December 2021.